Django Sort By Calculated Field - django

Using the distance logic from this SO post, I'm getting back a properly-filtered set of objects with this code:
class LocationManager(models.Manager):
def nearby_locations(self, latitude, longitude, radius, max_results=100, use_miles=True):
if use_miles:
distance_unit = 3959
else:
distance_unit = 6371
from django.db import connection, transaction
cursor = connection.cursor()
sql = """SELECT id, (%f * acos( cos( radians(%f) ) * cos( radians( latitude ) ) *
cos( radians( longitude ) - radians(%f) ) + sin( radians(%f) ) * sin( radians( latitude ) ) ) )
AS distance FROM locations_location HAVING distance < %d
ORDER BY distance LIMIT 0 , %d;""" % (distance_unit, latitude, longitude, latitude, int(radius), max_results)
cursor.execute(sql)
ids = [row[0] for row in cursor.fetchall()]
return self.filter(id__in=ids)
The problem is I can't figure out how to keep the list/ queryset sorted by the distance value. I don't want to do this as an extra() method call for performance reasons (one query versus one query on each potential location in my database). A couple of questions:
How can I sort my list by distance? Even taking off the native sort I've defined in my model and using "order_by()", it's still sorting by something else (id, I believe).
Am I wrong about the performance thing and Django will optimize the query, so I should use extra() instead?
Is this the totally wrong way to do this and I should use the geo library instead of hand-rolling this like a putz?

To take your questions in reverse order:
Re 3) Yes, you should definitely take advantage of PostGIS and GeoDjango if you're working with geospatial data. It's just silly not to.
Re 2) I don't think you could quite get Django to do this query for you using .extra() (barring acceptance of this ticket), but it is an excellent candidate for the new .raw() method in Django 1.2 (see below).
Re 1) You are getting a list of ids from your first query, and then using an "in" query to get a QuerySet of the objects corresponding to those ids. Your second query has no access to the calculated distance from the first query; it's just fetching a list of ids (and it doesn't care what order you provide those ids in, either).
Possible solutions (short of ditching all of this and using GeoDjango):
Upgrade to Django 1.2 beta and use the new .raw() method. This allows Django to intelligently interpret the results of a raw SQL query and turn it into a QuerySet of actual model objects. Which would reduce your current two queries into one, and preserve the ordering you specify in SQL. This is the best option if you are able to make the upgrade.
Don't bother constructing a Django queryset or Django model objects at all, just add all the fields you need into the raw SQL SELECT and then use those rows direct from the cursor. May not be an option if you need model methods etc later on.
Perform a third step in Python code, where you iterate over the queryset and construct a Python list of model objects in the same order as the ids list you got back from the first query. Return that list instead of a QuerySet. Won't work if you need to do further filtering down the line.

Related

Improve Django queryset performance when using annotate Exists

I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines.
The objective of this annotate is to generate a xlsx spreadsheet.
Models representation, RelatedModel is manytomany between Model and AnotherModel
Model:
id
field1
field2
field3
RelatedModel:
foreign_key_model (Model)
foreign_key_another (AnotherModel)
Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes.
Model.objects.all().annotate(
related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
related_column=Case(
When(related_exists=True, then=Value('The relation exists!')),
When(related_exists=False, then=Value('The relation doesn't exist!')),
default=Value('This is the default value!'),
output_field=CharField(),
)
).values_list(
'related_column',
'field1',
'field2',
'field3'
)
If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one related_exists BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user.
Other things to consider:
Indexes to speed-up filtering.
If you have millions of records after filtering, in one table - maybe table partitioning could be considered.
But let's look into raw sql of original query. It will be like this:
SELECT [model_fields],
EXISTS([CLIENT_SELECT]) AS related_exists,
CASE
WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!'
WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!'
ELSE 'The relation exists!'
END AS related_column
FROM model;
And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x.
First, EXISTS returns either True or False, we can leave just one check that it is True, making 'The relation does not exist!' the default value.
related_column=Case(
When(related_exists=True, then=Value('The relation exists!')),
default=Value('The relation does not exist!')
Why related_column performs same select again and not takes the value of related_exists?
Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression.
Wait, then we actually do not need related_exists column, lets just leave related_column with CASE statement and 1 exists subquery.
Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first.
So, it our case it is like: in order to use Exist in When, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression.
Good news!
Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. Exists is one of such BooleaField expressions.
Model.objects.all().annotate(
related_column=Case(
When(
Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
then=Value('The relation exists!'),
),
default=Value('The relation doesn't exist!'),
output_field=CharField(),
)
)
And only one nested select, and one annotated field.
Django 2.1, 2.2
Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of conditional attribute on expression object and check for this attribute.
So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no conditional check, and it will require more intrusive changes):
create Exists expression instance
monkey patch it with conditional = True
use it as condition in When statement
related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id')))
setattr(related_model_exists, 'conditional', True)
Model.objects.all().annotate(
related_column=Case(
When(
relate_model_exists,
then=Value('The relation exists!'),
),
default=Value('The relation doesn't exist!'),
output_field=CharField(),
)
)
Related checks
relatedmodel_set__isnull=True check is not suitable for several reasons:
it performs LEFT OUTER JOIN - that is less efficient than EXISTS
it performs LEFT OUTER JOIN - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations
You can considerably simplify your query to:
from django.db.models import Count
Model.objects.all().annotate(
related_column=Case(
When(relatedmodel_set__isnull=True, then=Value("The relation doesn't exist!")),
default=Value("The relation exists!"),
output_field=CharField()
)
)
Where relatedmodel_set is the related_name on your foreign key.

Django foreign keys in extra() expression

I'm trying to use the Django extra() method to filter all the objects in a certain radius, just like in this answer: http://stackoverflow.com/questions/19703975/django-sort-by-distance/26219292 but I'm having some problems with the 'gcd' expression as I have to reach the latitude and longitude through two foreign key relationships, instead of using direct model fields.
In particular, I have one Experience class:
class Experience(models.Model):
starting_place_geolocation = models.ForeignKey(GooglePlaceMixin, on_delete=models.CASCADE,
related_name='experience_starting')
visiting_place_geolocation = models.ForeignKey(GooglePlaceMixin, on_delete=models.CASCADE,
related_name='experience_visiting')
with two foreign keys to the same GooglePlaceMixin class:
class GooglePlaceMixin(models.Model):
latitude = models.DecimalField(max_digits=20, decimal_places=15)
longitude = models.DecimalField(max_digits=20, decimal_places=15)
...
Here is my code to filter the Experience objects by starting place location:
def search_by_proximity(self, experiences, latitude, longitude, proximity):
gcd = """
6371 * acos(
cos(radians(%s)) * cos(radians(starting_place_geolocation__latitude))
* cos(radians(starting_place_geolocation__longitude) - radians(%s)) +
sin(radians(%s)) * sin(radians(starting_place_geolocation__latitude))
)
"""
gcd_lt = "{} < %s".format(gcd)
return experiences \
.extra(select={'distance': gcd},
select_params=[latitude, longitude, latitude],
where=[gcd_lt],
params=[latitude, longitude, latitude, proximity],
order_by=['distance'])
but when I try to call the foreign key object "strarting_place_geolocation__latitude" it returns this error:
column "starting_place_geolocation__latitude" does not exist
What should I do to reach the foreign key value? Thank you in advance
When you are using extra (which should be avoided, as stated in documentation), you are actually writing raw SQL. As you probably know, to get value from ForeignKey you have to perform JOIN. When using Django ORM, it translates that fancy double underscores to correct JOIN clause. But the SQL can't. And you also cannot add JOIN manually. The correct way here is to stick with ORM and define some custom database functions for sin, cos, radians and so on. That's pretty easy.
class Sin(Func):
function = 'SIN'
Then use it like this:
qs = experiences.annotate(distance=Cos(Radians(F('starting_place_geolocation__latitude') )) * ( some other expressions))
Note the fancy double underscores comes back again and works as expected
You have got the idea.
Here is a full collection of mine if you like copy pasting from SO)
https://gist.github.com/tatarinov1997/3af95331ef94c6d93227ce49af2211eb
P. S. You can also face the set output_field error. Then you have to wrap your whole distance expression into ExpressionWrapper and provide it an output_field=models.DecimalField() argument.

Django Postgres ArrayField aggregation and filtering

Following on from this question: Django Postgresql ArrayField aggregation
I have an ArrayField of Categories and I would like to retrieve all unique values it has - however the results should be filtered so that only values starting with the supplied string are returned.
What's the "most Django" way of doing this?
Given an Animal model that looks like this:
class Animal(models.Model):
# ...
categories = ArrayField(
models.CharField(max_length=255, blank=True),
default=list,
)
# ...
Then, as per the other question's answer, this works for finding all categories, unfiltered.
all_categories = (
Animal.objects
.annotate(categories_element=Func(F('categories'), function='unnest'))
.values_list('categories_element', flat=True)
.distinct()
)
However, now, when I attempt to filter the result I get failure, not just with __startswith but all types of filter:
all_categories.filter(categories_element__startswith('ga'))
all_categories.filter(categories_element='dog')
Bottom of stacktrace is:
DataError: malformed array literal: "dog"
...
DETAIL: Array value must start with "{" or dimension information.
... and it appears that it's because Django tries to do a second UNNEST - this is the SQL it generates:
...) WHERE unnest("animal"."categories") = dog::text[]
If I write the query in PSQL then it appears to require a subquery as a result of the UNNEST:
SELECT categories_element
FROM (
SELECT UNNEST(animal.categories) as categories_element
) ul
WHERE ul.categories_element like 'Ga%';
Is there a way to get Django ORM to make a working query? Or should I just give up on the ORM and use raw SQL?
You probably have the wrong database design.
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.
http://www.postgresql.org/docs/9.1/static/arrays.html

How to add distance from point as an annotation in GeoDjango

I have a Geographic Model with a single PointField, I'm looking to add an annotation for the distance of each model from a given point, which I can later filter on and do additional jiggery pokery.
There's the obvious queryset.distance(to_point) function, but this doesn't actually annotate the queryset, it just adds a distance attribute to each model in the queryset, meaning I can't then apply .filter(distance__lte=some_distance) to it later on.
I'm also aware of filtering by the field and distance itself like so:
queryset.filter(point__distance_lte=(to_point, D(mi=radius)))
but since I will want to do multiple filters (to get counts of models within different distance ranges), I don't really want to make the DB calculate the distance from the given point every time, since that could be expensive.
Any ideas? Specifically, is there a way to add this as a regular annotation rather than an inserted attribute of each model?
I couldn't find any baked in way of doing this, so in the end I just created my own Aggregation class:
This only works with post_gis, but making one for another geo db shouldn't be too tricky.
from django.db.models import Aggregate, FloatField
from django.db.models.sql.aggregates import Aggregate as SQLAggregate
class Dist(Aggregate):
def add_to_query(self, query, alias, col, source, is_summary):
source = FloatField()
aggregate = SQLDist(
col, source=source, is_summary=is_summary, **self.extra)
query.aggregates[alias] = aggregate
class SQLDist(SQLAggregate):
sql_function = 'ST_Distance_Sphere'
sql_template = "%(function)s(ST_GeomFromText('%(point)s'), %(field)s)"
This can be used as follows:
queryset.annotate(distance=Dist('longlat', point="POINT(1.022 -42.029)"))
Anyone knows a better way of doing this, please let me know (or tell me why mine is stupid)
One of the modern approaches is the set "output_field" arg to avoid «Improper geometry input type: ». Withour output_field django trying to convert ST_Distance_Sphere float result to GEOField and can not.
queryset = self.objects.annotate(
distance=Func(
Func(
F('addresses__location'),
Func(
Value('POINT(1.022 -42.029)'),
function='ST_GeomFromText'
),
function='ST_Distance_Sphere',
output_field=models.FloatField()
),
function='round'
)
)
Doing it like this this works for me, ie I can apply a filter on an annotation.
Broken up for readability.
from models import Address
from django.contrib.gis.measure import D
from django.contrib.gis.db.models.functions import Distance
intMiles = 200
destPoint = Point(5, 23)
queryset0 = Address.objects.all().order_by('-postcode')
queryset1 = queryset0.annotate(distance=Distance('myPointField' , destPoint ))
queryset2 = queryset1.filter(distance__lte=D(mi=intMiles))
Hope it helps somebody :)
You can use GeoQuerySet.distance
cities = City.objects.distance(reference_pnt)
for city in cities:
print city.distance()
Link: GeoDjango distance documentaion
Edit: Adding distance attribute along with distance filter queries
usr_pnt = fromstr('POINT(-92.69 19.20)', srid=4326)
City.objects.filter(point__distance_lte=(usr_pnt, D(km=700))).distance(usr_pnt).order_by('distance')
Supported distance lookups
distance_lt
distance_lte
distance_gt
distance_gte
dwithin
A way to annotate & sort w/out GeoDjango. This model contains a foreignkey to a Coordinates record which contains lat and lng properties.
def get_nearby_coords(lat, lng, max_distance=10):
"""
Return objects sorted by distance to specified coordinates
which distance is less than max_distance given in kilometers
"""
# Great circle distance formula
R = 6371
qs = Precinct.objects.all().annotate(
distance=Value(R)*Func(
Func(
F("coordinates__lat")*Value(math.sin(math.pi/180)),
function="sin",
output_field=models.FloatField()
) * Value(
math.sin(lat*math.pi/180)
) + Func(
F("coordinates__lat")* Value(math.pi/180),
function="cos",
output_field=models.FloatField()
) * Value(
math.cos(lat*math.pi/180)
) * Func(
Value(lng*math.pi/180) - F("coordinates__lng") * Value(math.pi/180),
function="cos",
output_field=models.FloatField()
),
function="acos"
)
).order_by("distance")
if max_distance is not None:
qs = qs.filter(distance__lt=max_distance)
return qs

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()