Django - Perform a complex operation in .update() - django

I have the following model:
class Address(models.Model):
full_address = models.CharField(max_length=100)
Some full_address ends with "Region". Examples:
123 Main Street, Markham, York Region
1 Bloor Street, Mississauga, Peel Region
I want to remove "Region" from any full_address field that ends with it.
Here is one possible solution, but it is slow, since you need to go over each Address one by one:
for i in Address.objects.filter(full_address__endswith=' Region'):
i.full_address = i.full_address[:-7]
i.save()
Is there some way to achieve the above function using Address.objects.update()?

You can use Query Expressions here
from django.db.models import F, Func, Value
Address.objects.filter(full_address__endswith=' Region').update(
full_address=Func(
F('full_address'),
Value(' Region'), Value(''),
function='replace',
)
)
Note that if you think you could get a string that contained the text ' Region' as well as ending with that string, this will replace both with the empty string. It seems unlikely, but if you want to be especially careful you could use regexp_replace instead of replace and use the appropriate expression for the first Value (i.e. Value(' Region$')) to ensure you only match the one part you want to replace.

Related

Django Query with Case Insensitive Data Both Ways

There are a lot of similar questions, but I'm only finding partial solutions.
I have a group of users stored as objects, with a name attribute (User.name). I'm hoping to do a query with a user input (Foo) such that I can (without being case sensitive) find all users where either:
foo is in User.name
User.name is in foo
As an example, I want the user to be able to type in "Jeff William II" and return "Anderson Jeff William II", "jeff william iii", as well as "Jeff Will" and "william ii"
I know I can use the Q function to combine two queries, and I can use annotate() to transform User.name like so (though I welcome edits if you notice errors in this code):
users = User.objects.annotate(name_upper=Upper(name)).filter(Q(name_upper__icontains=foo) | Q(name_upper__in=foo))
But I'm running into trouble using __in to match multiple letters within a string. So if User.name is "F" I get a hit when inputting Jeff but if User.name is "JE" then it doesn't show up. How do I match multiple letters, or is there a better way to make this query?
SIDE NOTE: I initially solved this with the following, but would prefer making a query if possible.
for u in User.objects.all():
if u.name in foo or foo in u.name:
Please do not use Upper. It is a common misconception that by converting two items to uppercase (or lowercase) you make a case insenstive equality check. Certain characters, like ß have no uppercase/lowercase, and have more complicated rules (collation) to consider these equivalent. In Python one uses .casefold(…) [python-doc] for that.
For the database, you can simply make use of annotate, and then use two checks:
from django.db.models import CharField, F, Q, Value
foo = 'Jeff William II'
User.objects.annotate(foo=Value(foo, CharField())).filter(
Q(name__icontains=foo) | Q(foo__icontains=F('name'))
)

How to get column names of a pandas.DataFrame from below given description of the data

Every column name ends with a colon and the next column name starts on newline with the previous line ended with a fullstop , so there should be a way to get a list of column name from the string
data_description = '''age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school.
education-num: continuous.'''
How can I get the below output
Columns = ['age','workclass','fnlwgt','education','education-num']
The title of your post says, get column names of a pandas.DataFrame from below and I don't see pandas code written anywhere in your explanation.
You could do this very easily through pandas:
First create your dictionary like this:
data_description = {'age': ['continuous.'],
'workclass': ['Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.'],
'fnlwgt': ['continuous.'],
'education':[ 'Bachelors, Some-college, 11th, HS-grad, Prof-school.'],
'education-num': ['continuous.']}
Then create a dataframe using above dict
df = pd.DataFrame(data_description)
Then just say, list(df.columns) and it will give you all column names in a list.
In [1009]: list(df.columns)
Out[1009]: ['age', 'education', 'education-num', 'fnlwgt', 'workclass']
Try this:
>>> Columns = [i.split(':')[0] for i in data_description.split() if ':' in i]
>>> Columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
Using regular expressions, capture the none space (\S) characters before a ie parenthesis are used to capture. \S means opposite of space. :. In this case, you can simply do:
import re
re.findall(r'(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
if you need to take the \n into consideration, maybe because there might be some in the data which are not column names yet succeeded by a colon then:
re.findall(r'(?:^|\n)(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
I would first remove all \n that are imported with the string, and then apply some split()and filter() methods, like this:
data_description = data_description.replace("\n", "")
columns = [i.split(":")[0] for i in list(filter(None, data_description.split(".")))]
Now you get the name of each column:
columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
There is not a general rule. For each case you have to think how to remove leading and trailing whitespace and try use methods like split in a way you get what you need.
This is a simple one-liner.
print([every_line.split(':')[0] for every_line in data_description.split('\n')])

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.
I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.
You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".
I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.
You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

Is there a way to filter a django queryset based on string similarity (a la python difflib)?

I have a need to match cold leads against a database of our clients.
The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.
Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.
I need to use some sort of string similarity mechanism.
Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.
I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.
Is there another solution? Is there a better solution?
Edit:
Soundex, at least in my db, doesn't behave as well as difflib.
Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":
with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
from temp
where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
order by difference(first_name+' '+last_name, 'Joe Lopes')
The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.
But difflib does a much better job:
difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']
Edit after gruszczy's response:
Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.
In testing it, it still won't do a better matching job than difflib.
Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.
Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.
If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.
soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.
You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.
It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search
As per the answer of andilabs you can use the Levenshtein function to create your custom function. Postgres doc indicates that the Levenshtein function is as follows:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
levenshtein(text source, text target) returns int
andilabs answer can use the only second function. If you want a more advanced search with insertion/deletion/substitution costs, you can rewrite function like this:
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
function = 'levenshtein'
def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
ins_cost=ins_cost,
del_cost=del_cost,
sub_cost=sub_cost,
**extras
)
And call the function:
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1) # ins = 3, del = 3, sub = 1
).filter(
lev_dist__lte=2
)
If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:
Extension needed fuzzystrmatch
you need adding postgres extension to your db in psql:
CREATE EXTENSION fuzzystrmatch;
Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s')"
function = "levenshtein"
def __init__(self, expression, search_term, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
**extras
)
then in any other place in project we just import defined Levenshtein and F to pass the django field.
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
lev_dist__lte=2
)

What is simplest way join __contains and __in?

I am doing tag search function, user could observe a lot of tags, I get it all in one tuple, and now I would like to find all text which include at least one tag from the list.
Symbolic: text__contains__in=('asd','dsa')
My only idea is do loop e.g.:
q = text.objects.all()
for t in tag_tuple:
q.filter(data__contains=t)
For example:
input tuple of tags, ('car', 'cat', 'cinema')
output all messages what contains at least one word from that tuple, so My cat is in the car , cat is not allowed in the cinema, i will drive my car to the cinema
Thanks for help!
Here you go:
filter = Q()
for t in tag_tuple:
filter = filter | Q(data__contains=t)
return text.objects.filter(filter)
A couple of tips:
You should be naming your model classes with a capital (i.e. Text, not text)
You may want __icontains instead if you're not worried about the case
I don't know Django, so I have no idea how to apply this filter, but it seems you want a function like this one:
def contains_one_of(tags, text):
text = text.split() # tags should match complete words, not partial words
return any(t in text for t in tags)