How could I translate a spatial JOIN into Django query language? - django

Context
I have two tables app_area and app_point that are not related in any way (no Foreign Keys) expect each has a geometry field so we could spatially query them, respectively of polygon and point type. The bare model looks like:
from django.contrib.gis.db import models
class Point(models.Model):
# ...
geom = models.PointField(srid=4326)
class Area(models.Model):
# ...
geom = models.PolygonField(srid=4326)
I would like to create a query which filters out points that are not contained in polygon.
If I had to write it with a Postgis/SQL statement to perform this task I would issue this kind of query:
SELECT
P.*
FROM
app_area AS A JOIN app_point AS P ON ST_Contains(A.geom, P.geom);
Which is simple and efficient when spatial indices are defined.
My concern is to write this query without hard coded SQL in my Django application. Therefore, I would like to delegate it to the ORM using the classical Django query syntax.
Issue
I could not find a clear example of this kind of query on the internet, solutions I have found:
Either rely on predefined relation using ForeignKeyField or prefetch_related (but this relation does not exist in my case);
Or use a single hand crafted geometry to represent the polygon (but this is not my use case as I want to rely on another table as the polygon source).
I have the feeling that is definitely achievable with Django but maybe I am too new to this framework or it is not enough documented or I have not found the rightful keywords set to google it.
The best I could find in the official documentation is the FilteredRelation object which seems to do what I want: defining the ON part of the JOIN clause, but I could not setup properly, mainly I don't understand how to fill the other table and point to proper fields.
from django.db.models import F, Q, FilteredRelation
query = Location.objects.annotate(
campus=FilteredRelation(<relation_name>, condition=Q(geom__contains=F("geom")))
)
Mainly the field relation_name puzzle me. I would expect it to be the table I would join (here Area) on but it seems it is a column name which is expected.
django.core.exceptions.FieldError: Cannot resolve keyword 'Area' into field. Choices are: created, geom, id, ...
But this list of fields are from Point table.
My question is: How could I translate my spatial JOIN into Django query language?
Nota: There is no requirement to rely on FilteredRelation object, this is just the best match I have found for now!
Update
I am able to emulate the expected output using extra:
results = models.Point.objects.extra(
where=["ST_intersects(app_area.geom, app_point.geom)"],
tables=["app_area"]
)
Which returns a QuerySet but it still needs to inject plain SQL statement and then the generated SQL are not equivalent in term of clauses:
SELECT "app_point"."id", "app_point"."geom"::bytea
FROM "app_point", "app_area"
WHERE (ST_intersects(app_area.geom, app_point.geom))
And EXPLAIN performances.

I think, the best solution would be to aggregate the areas and then do an intersection with the points.
from django.db.models import Q
from django.contrib.gis.db.models.aggregates import Union
multipolygon_area = Area.objects.aggregate(area=Union("geom"))["area"]
# Get all points inside areas
Points.objects.filter(geom__intersects=multipolygon_area)
# Get all points outside areas
Points.objects.filter(~Q(geom__intersects=multipolygon_area))
This is quite efficient as it is completely calculated on the database level.
The idea was found here

Related

How can I annotate another annotate group by query in Django?

I have two queries:
Proyecto.objects.filter().order_by('tipo_proyecto')
Proyecto.objects.values('tipo_proyecto').annotate(total=Sum('techo_presupuestario'))
How can I make this in only one query? I want that the first query contains an annotate data that represents all sums of techo_presupuestario, depending on your tipo_proyecto. Is this posible?
If I understand you correct, you'd like to add a conditionally aggregated sum over one field to each object, so you get each object with a sum fitting to its tipo_proyecto. Right?
I don't know, if this makes sense, but it could be done anyway using Subquery:
from django.db.models import Sum, Subquery, OuterRef
sq = Subquery(Proyecto.objects.filter(tipo_proyecto=OuterRef("tipo_proyecto")).values("tipo_proyecto").annotate(techoSum=Sum("techo_presupuestario")).values("techoSum"))
Proyecto.objects.all().annotate(tipoTechoSum = sq).order_by('tipo_proyecto')
Nonetheless, I wouldn't recommend this, as it puts some heavy load on your database. (In MySQL there will be an nested SELECT statement referring to the same table, which might be pretty unpleasant depending on the table's size.)
I'd say the better approach is to "collect" your aggregated sums separately and add the values to the model objects in your code.

Django bulk update with data two tables over

I want to bulk update a table with data two tables over. A solution has been given for the simpler case mentioned in the documentation of:
Entry.objects.update(headline=F('blog__name'))
For that solution, see
https://stackoverflow.com/a/50561753/1092940
Expanding from the example, imagine that Entry has a Foreign Key reference to Blog via a field named blog, and that Blog has a Foreign Key reference to User via a field named author. I want the equivalent of:
Entry.objects.update(author_name=F('blog__author__username'))
As in the prior solution, the solution is expected to employ SubQuery and OuterRef.
The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.
The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.
It does not require multiple outer references, you can update with:
from django.db.models import OuterRef, Subquery
author_name = Author.objects.filter(
blogs__id=OuterRef('blog_id')
).values_list(
'username'
)[:1]
Entry.objects.update(
author_name=Subquery(author_name)
)
You here thus specify that you look for an Author with a related Blog with an id equal to the blog_id of the Entry.

Ordering Django querysets using a JSONField's properties

I have a model that kinda looks like this:
class Person(models.Model):
data = JSONField()
The data field has 2 properties, name, and age. Now, lets say I want to get a paginated queryset (each page containing 20 people), with a filter where age is greater than 25, and the queryset is to be ordered in descending order. In a usual setup, that is, a normalized database, I can write this query like so:
person_list_page_1 = Person.objects.filter(age > 25).order_by('-age')[:20]
Now, what is the equivalence of the above when filtering and ordering using keys stored in the JSONField? I have researched into this, and it seems it was meant to be a feature for 2.1, but I can't seem to find anything relevant.
Link to the ticket about it being implemented in the future
I also have another question. Lets say we filter and order using the JSONField. Will the ORM have to get all the objects, filter, and order them before sending the first 20 in such a case? That is, will performance be legitimately slower?
Obviously, I know a normalized database is far better for these things, but my hands are kinda tied.
You can use the postgresql sql syntax to extract subfields. Then they can be used just as any other field on the model in queryset filters.
from django.db.models.expressions import RawSQL
Person.objects.annotate(
age=RawSQL("(data->>'age')::int", [])
).filter(age__gte=25).order_by('-age')[:20]
See the postgresql docs for other operators and functions.
In some cases, you might have to add explicit typecasts (::int, for example)
https://www.postgresql.org/docs/current/static/functions-json.html
Performance will be slower than with a proper field, but it's not bad.

Why is the database used by Django subqueries not sticky?

I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?
This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related

Django ORM: Optimizing queries involving many-to-many relations

I have the following model structure:
class Container(models.Model):
pass
class Generic(models.Model):
name = models.CharacterField(unique=True)
cont = models.ManyToManyField(Container, null=True)
# It is possible to have a Generic object not associated with any container,
# thats why null=True
class Specific1(Generic):
...
class Specific2(Generic):
...
...
class SpecificN(Generic):
...
Say, I need to retrieve all Specific-type models, that have a relationship with a particular Container.
The SQL for that is more or less trivial, but that is not the question. Unfortunately, I am not very experienced at working with ORMs (Django's ORM in particular), so I might be missing a pattern here.
When done in a brute-force manner, -
c = Container.objects.get(name='somename') # this gets me the container
items = c.generic_set.all()
# this gets me all Generic objects, that are related to the container
# Now what? I need to get to the actual Specific objects, so I need to somehow
# get the type of the underlying Specific object and get it
for item in items:
spec = getattr(item, item.get_my_specific_type())
this results in a ton of db hits (one for each Generic record, that relates to a Container), so this is obviously not the way to do it. Now, it could, perhaps, be done by getting the SpecificX objects directly:
s = Specific1.objects.filter(cont__name='somename')
# This gets me all Specific1 objects for the specified container
...
# do it for every Specific type
that way the db will be hit once for each Specific type (acceptable, I guess).
I know, that .select_related() doesn't work with m2m relationships, so it is not of much help here.
To reiterate, the end result has to be a collection of SpecificX objects (not Generic).
I think you've already outlined the two easy possibilities. Either you do a single filter query against Generic and then cast each item to its Specific subtype (results in n+1 queries, where n is the number of items returned), or you make a separate query against each Specific table (results in k queries, where k is the number of Specific types).
It's actually worth benchmarking to see which of these is faster in reality. The second seems better because it's (probably) fewer queries, but each one of those queries has to perform a join with the m2m intermediate table. In the former case you only do one join query, and then many simple ones. Some database backends perform better with lots of small queries than fewer, more complex ones.
If the second is actually significantly faster for your use case, and you're willing to do some extra work to clean up your code, it should be possible to write a custom manager method for the Generic model that "pre-fetches" all the subtype data from the relevant Specific tables for a given queryset, using only one query per subtype table; similar to how this snippet optimizes generic foreign keys with a bulk prefetch. This would give you the same queries as your second option, with the DRYer syntax of your first option.
Not a complete answer but you can avoid a great number of hits by doing this
items= list(items)
for item in items:
spec = getattr(item, item.get_my_specific_type())
instead of this :
for item in items:
spec = getattr(item, item.get_my_specific_type())
Indeed, by forcing a cast to a python list, you force the django orm to load all elements in your queryset. It then does this in one query.
I accidentally stubmled upon the following post, which pretty much answers your question :
http://lazypython.blogspot.com/2008/11/timeline-view-in-django.html