Django Query with Case Insensitive Data Both Ways - django

There are a lot of similar questions, but I'm only finding partial solutions.
I have a group of users stored as objects, with a name attribute (User.name). I'm hoping to do a query with a user input (Foo) such that I can (without being case sensitive) find all users where either:
foo is in User.name
User.name is in foo
As an example, I want the user to be able to type in "Jeff William II" and return "Anderson Jeff William II", "jeff william iii", as well as "Jeff Will" and "william ii"
I know I can use the Q function to combine two queries, and I can use annotate() to transform User.name like so (though I welcome edits if you notice errors in this code):
users = User.objects.annotate(name_upper=Upper(name)).filter(Q(name_upper__icontains=foo) | Q(name_upper__in=foo))
But I'm running into trouble using __in to match multiple letters within a string. So if User.name is "F" I get a hit when inputting Jeff but if User.name is "JE" then it doesn't show up. How do I match multiple letters, or is there a better way to make this query?
SIDE NOTE: I initially solved this with the following, but would prefer making a query if possible.
for u in User.objects.all():
if u.name in foo or foo in u.name:

Please do not use Upper. It is a common misconception that by converting two items to uppercase (or lowercase) you make a case insenstive equality check. Certain characters, like ß have no uppercase/lowercase, and have more complicated rules (collation) to consider these equivalent. In Python one uses .casefold(…) [python-doc] for that.
For the database, you can simply make use of annotate, and then use two checks:
from django.db.models import CharField, F, Q, Value
foo = 'Jeff William II'
User.objects.annotate(foo=Value(foo, CharField())).filter(
Q(name__icontains=foo) | Q(foo__icontains=F('name'))
)

Related

ArangoDB query find name consisting of two or more parts

i am quite new to arangodb and AQL. What i am trying to do is to find names and surnames with query that have two or more parts that may or may not include spaces (" ") to divide those said parts.
As i went through documents, I figured that those fileds should be indexed as full-text indexes and query would be something like
FOR u IN FULLTEXT(collection, field, "search") RETURN u
Now my question is how to utilize the above query for a Persian/Arabic names such as
سید امیر حسین
Keep in mind that this name comes from an input of a user and it could be in the following formats such as
سید امیر حسین/سید امیرحسین/سیدامیر حسین/سیدامیرحسین
Note the various types of names with/without spaces. I tried the follwoing query F
FOR u IN FULLTEXT(collection, field, "سید,|امیر,|حسین,|سیدامیرحسین") RETURN u
But the problem is this will also result of fetching the name because i reckon the use of OR ( | ) operator.
سید محمد
which is not my intention. So what am i doing wrong or missing here ?
Many thanks in advance and excuse my noobiness
PS : arango 3.8.3

How to make full text search in PostgreSQL to search any order of words in the input search_text and also if there are words between them

I'm trying to use Django full-text search but I'm having a problem:
I've followed this documentation, and it works quite well. But my problem is that I don't want the postgress to consider the order or permutation of the words I search.
I mean I want the result of searching "good boy" and "boy good" to be the same.
and also when I search "good boy" I want to see the "good bad boy" in the results.
But none of these happen and I can't query "good bad boy" with typing "boy good" or even "good boy" (Because of the "bad" missing).
I've tried to split search_text by space and then & the search queries like this in order to remove the order of words but it didn't work.
I changed this code:
search_query = SearchQuery(
search_text
)
search_rank = SearchRank(search_vectors, search_query)
to this:
s = SearchQuery(search_text.split(' ')[0])
for x in search_text.split(' ')[1:]:
s = s & SearchQuery(x)
search_query = s
search_rank = SearchRank(search_vectors, search_query)
You can achieve this quite easily in PostgreSQL.
I suggest reading carefully the documentation on Basic Text Matching.
What you need is to use <->, (FOLLOWED BY) tsquery operator.
If you use an integer instead of - you can express the desired proximity :
<N>, where N is an integer standing for the difference between the positions of the matching lexemes.
So you should find a way to make Python to execute this PostgreSQL function:
to_tsquery('good <1> boy | boy <1> good' );.
Maybe just calling SearchQuery with the string 'good <1> boy | boy <1> good' just works, but you should refer to the tutorial documentation to find how to use <-> with the method SearchQuery.
Edited after comment.
Looking at Django source code you can see that the constructor for SearchQuery class pass a default parameter search_type='plain'.
You can pass a different parameter to implement a different kind of search.
In the Django project documentation, you can find one example for each accepted string value for the named parameter search_type: 'plain', 'phrase', 'raw', 'websearch'.
If you pass search_type=raw you can provide a formatted search query with terms and operators, but I'm not sure if Django supports the <Ineger> operator as I said before you should try it.
Have you tried a search with search_type='raw', and a value of "'good <1> boy | boy <1> good'"?
If the FOLLOWED BY operator (<->) is not supported the close you can get to what you wanted is calling SearchQuery with search_type='phrase'.
This will find both good boy and boy good but the exact behaviour depends on the definition of the stop words used by the Dictionary set for PostgreSQL.

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.
I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.
You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".
I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.
You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

Is there a way to filter a django queryset based on string similarity (a la python difflib)?

I have a need to match cold leads against a database of our clients.
The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.
Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.
I need to use some sort of string similarity mechanism.
Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.
I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.
Is there another solution? Is there a better solution?
Edit:
Soundex, at least in my db, doesn't behave as well as difflib.
Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":
with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
from temp
where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
order by difference(first_name+' '+last_name, 'Joe Lopes')
The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.
But difflib does a much better job:
difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']
Edit after gruszczy's response:
Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.
In testing it, it still won't do a better matching job than difflib.
Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.
Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.
If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.
soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.
You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.
It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search
As per the answer of andilabs you can use the Levenshtein function to create your custom function. Postgres doc indicates that the Levenshtein function is as follows:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
levenshtein(text source, text target) returns int
andilabs answer can use the only second function. If you want a more advanced search with insertion/deletion/substitution costs, you can rewrite function like this:
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
function = 'levenshtein'
def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
ins_cost=ins_cost,
del_cost=del_cost,
sub_cost=sub_cost,
**extras
)
And call the function:
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1) # ins = 3, del = 3, sub = 1
).filter(
lev_dist__lte=2
)
If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:
Extension needed fuzzystrmatch
you need adding postgres extension to your db in psql:
CREATE EXTENSION fuzzystrmatch;
Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s')"
function = "levenshtein"
def __init__(self, expression, search_term, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
**extras
)
then in any other place in project we just import defined Levenshtein and F to pass the django field.
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
lev_dist__lte=2
)

What is simplest way join __contains and __in?

I am doing tag search function, user could observe a lot of tags, I get it all in one tuple, and now I would like to find all text which include at least one tag from the list.
Symbolic: text__contains__in=('asd','dsa')
My only idea is do loop e.g.:
q = text.objects.all()
for t in tag_tuple:
q.filter(data__contains=t)
For example:
input tuple of tags, ('car', 'cat', 'cinema')
output all messages what contains at least one word from that tuple, so My cat is in the car , cat is not allowed in the cinema, i will drive my car to the cinema
Thanks for help!
Here you go:
filter = Q()
for t in tag_tuple:
filter = filter | Q(data__contains=t)
return text.objects.filter(filter)
A couple of tips:
You should be naming your model classes with a capital (i.e. Text, not text)
You may want __icontains instead if you're not worried about the case
I don't know Django, so I have no idea how to apply this filter, but it seems you want a function like this one:
def contains_one_of(tags, text):
text = text.split() # tags should match complete words, not partial words
return any(t in text for t in tags)