is distinct an expensive query in django? - django

I have three models: Product, Category and Place.
Product has ManyToMany relation with Category and Place.
I need to get a list of categories with at least on product matching a specific place.
For example I might need to get all the categories that has at least one product from Boston.
I have 100 categories, 500 places and 100,000 products.
In sqlite with 10K products the query takes ~ a second.
In production I'll use postgresql.
I'm using:
categories = Category.objects.distinct().filter(product__place__name="Boston")
Is this query going to be expensive?
Is there a better way to do this?
This is the result of connection.queries
{'time': '0.929', 'sql': u'SELECT DISTINCT "catalog_category"."id", "catalog_category"."name" FROM "catalog_category" INNER JOIN "catalog_product_categories" ON ("catalog_category"."id" = "catalog_product_categories"."category_id") INNER JOIN "catalog_product" ON ("catalog_product_categories"."product_id" = "catalog_product"."id") INNER JOIN "catalog_product_places" ON ("catalog_product"."id" = "catalog_product_places"."product_id") INNER JOIN "catalog_place" ON ("catalog_product_places"."car_id" = "catalog_car"."id") WHERE "catalog_place"."name" = Boston ORDER BY "catalog_category"."name" ASC'}]
Thanks

This is not just a Django issue; DISTINCT is slow on most SQL implementations because it's a relatively hard operation. Here is a good discussion of why it's slow in Postgres specifically.
One way to handle this would be to use Django's excellent caching mechanism on this query, assuming that the results don't change often and minor staleness isn't a problem. Another approach would be to keep a separate list of just the distinct categories, perhaps in another table.

Although Chase is right that DISTINCT is generally a slow operation, in this case it is also completely pointless. As you can see from the generated SQL, the DISTINCT is being done on the combination of ID and name - which will never be duplicated anyway. So there is no need for the distinct() call in this query.
Generally, Django does not return duplicate results from a simple filter. The main time when distinct() is useful is when you are accessing a related queryset via a ManyToMany or ForeignKey relationship, where multiple items might be related to the same instance, and distinct will remove the duplicates.

Related

Django Postgres ArrayField vs One-to-Many relationship

For a model in my database I need to store around 300 values for a specific field. What would be the drawbacks, in terms of performance and simplicity in query, if I use Postgres-specific ArrayField instead of a separate table with One-to-Many relationship?
If you use an array field
The size of each row in your DB is going to be a bit large thus Postgres is going to be using a lot more toast tables (http://www.postgresql.org/docs/9.5/static/storage-toast.html)
Every time you get the row, unless you specifically use defer (https://docs.djangoproject.com/en/1.9/ref/models/querysets/#defer) the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields
The cost of the joins in this case is completely negligible because of keys.
Personally I'd say go with the M2M tables, but I don't know your specific application. If you're going to be working with a massive amount of data it's likely worth grabbing a representative dataset and testing both methods with it.

Django does a useless join when filtering

I have an optional foreign key for a parameter:
class Club(models.Model):
...
locationid = models.ForeignKey(location_models.Location, null=True)
...
I want to find entries of club where this foreignkey is not set. Here's the ORM query:
print Club.objects.filter(locationid=None).only('name').query
Produces
SELECT `club_club`.`id`, `club_club`.`name` FROM `club_club`
LEFT OUTER JOIN `location_location`
ON (`club_club`.`locationid_id` = `location_location`.`id`)
WHERE `location_location`.`id` IS NULL
Same query is produced when I do filter(locationid_id__isnull=True)
What I want is to query on locationid_id without involving a JOIN. I know I can write raw SQL, but is there an ORM-al way of doing this?
This seems to be quite a persistent issue, and the patch that fixed it had other side effects, so it never got applied to a release version of Django.
A solution to this is to use the extra method. This will require raw SQL, but only a limited amount and using SQL standards, so it should be compatible with all SQL databases:
location_null = '`%s`.`%s` IS NULL' % (Club._meta.db_table, Club.locationid.field.column)
Club.objects.extra(where=[location_null])
You can add this as a manager/queryset method for a more DRY solution.
The other option is to just take the performance hit. This is what I would recommend, unless benchmarking shows that the performance hit really is unacceptable in your specific case.

Django: how to filter() after distinct()

If we chain a call to filter() after a call to distinct(), the filter is applied to the query before the distinct. How do I filter the results of a query after applying distinct?
Example.objects.order_by('a','foreignkey__b').distinct('a').filter(foreignkey__b='something')
The where clause in the SQL resulting from filter() means the filter is applied to the query before the distinct. I want to filter the queryset resulting from the distinct.
This is probably pretty easy, but I just can't quite figure it out and I can't find anything on it.
Edit 1:
I need to do this in the ORM...
SELECT z.column1, z.column2, z.column3
FROM (
SELECT DISTINCT ON (b.column1, b.column2) b.column1, b.column2, c.column3
FROM table1 a
INNER JOIN table2 b ON ( a.id = b.id )
INNER JOIN table3 c ON ( b.id = c.id)
ORDER BY b.column1 ASC, b.column2 ASC, c.column4 DESC
) z
WHERE z.column3 = 'Something';
(I am using Postgres by the way.)
So I guess what I am asking is "How do you nest subqueries in the ORM? Is it possible?" I will check the documentation.
Sorry if I was not specific earlier. It wasn't clear in my head.
This is an old question, but when using Postgres you can do the following to force nested queries on your 'Distinct' rows:
foo = Example.objects.order_by('a','foreign_key__timefield').distinct('a')
bar = Example.objects.filter(pk__in=foo).filter(some_field=condition)
bar is the nested query as requested in OP without resorting to raw/extra etc. Tested working in 1.10 but docs suggest it should work back to at least 1.7.
My use case was to filter up a reverse relationship. If Example has some ForeignKey to model Toast then you can do:
Toast.objects.filter(pk__in=bar.values_list('foreign_key',flat=true))
This gives you all instances of Toast where the most recent associated example meets your filter criteria.
Big health warning about performance though, using this if bar is likely to be a huge queryset you're probably going to have a bad time.
Thanks a ton for the help guys. I tried both suggestions and could not bend either of those suggestions to work, but I think it started me in the right direction.
I ended up using
from django.db.models import Max, F
Example.objects.annotate(latest=Max('foreignkey__timefield')).filter(foreignkey__timefield=F('latest'), foreign__a='Something')
This checks what the latest foreignkey__timefield is for each Example, and if it is the latest one and a=something then keep it. If it is not the latest or a!=something for each Example then it is filtered out.
This does not nest subqueries but it gives me the output I am looking for - and it is fairly simple. If there is simpler way I would really like to know.
No you can't do this in one simple SELECT.
As you said in comments, in Django ORM filter is mapped to SQL clause WHERE, and distinct mapped to DISTINCT. And in a SQL, DISTINCT always happens after WHERE by operating on the result set, see SQLite doc for example.
But you could write sub-query to nest SELECTs, this depends on the actual target (I don't know exactly what's yours now..could you elaborate it more?)
Also, for your query, distinct('a') only keeps the first occurrence of Example having the same a, is that what you want?

querying a result of cfquery or filtering cfquery

I have two tables of data categories and category relations. This is a setup to allow infinite levels of parent/child relationships.
I could put it into a linked list too I guess, but this might allow one child to have multiple parents if the need ever comes up.
I have a query that combines these two as well as some other tables, and using those I have a list of all categories. I want to sort them by the level they're at, without counting and updating the database with the depth level.
and now the question....
Is there a way to do a where operation on a query, or filter it based on a value such as parentID=2 ?
Check out About Query of Queries

Django ORM: Optimizing queries involving many-to-many relations

I have the following model structure:
class Container(models.Model):
pass
class Generic(models.Model):
name = models.CharacterField(unique=True)
cont = models.ManyToManyField(Container, null=True)
# It is possible to have a Generic object not associated with any container,
# thats why null=True
class Specific1(Generic):
...
class Specific2(Generic):
...
...
class SpecificN(Generic):
...
Say, I need to retrieve all Specific-type models, that have a relationship with a particular Container.
The SQL for that is more or less trivial, but that is not the question. Unfortunately, I am not very experienced at working with ORMs (Django's ORM in particular), so I might be missing a pattern here.
When done in a brute-force manner, -
c = Container.objects.get(name='somename') # this gets me the container
items = c.generic_set.all()
# this gets me all Generic objects, that are related to the container
# Now what? I need to get to the actual Specific objects, so I need to somehow
# get the type of the underlying Specific object and get it
for item in items:
spec = getattr(item, item.get_my_specific_type())
this results in a ton of db hits (one for each Generic record, that relates to a Container), so this is obviously not the way to do it. Now, it could, perhaps, be done by getting the SpecificX objects directly:
s = Specific1.objects.filter(cont__name='somename')
# This gets me all Specific1 objects for the specified container
...
# do it for every Specific type
that way the db will be hit once for each Specific type (acceptable, I guess).
I know, that .select_related() doesn't work with m2m relationships, so it is not of much help here.
To reiterate, the end result has to be a collection of SpecificX objects (not Generic).
I think you've already outlined the two easy possibilities. Either you do a single filter query against Generic and then cast each item to its Specific subtype (results in n+1 queries, where n is the number of items returned), or you make a separate query against each Specific table (results in k queries, where k is the number of Specific types).
It's actually worth benchmarking to see which of these is faster in reality. The second seems better because it's (probably) fewer queries, but each one of those queries has to perform a join with the m2m intermediate table. In the former case you only do one join query, and then many simple ones. Some database backends perform better with lots of small queries than fewer, more complex ones.
If the second is actually significantly faster for your use case, and you're willing to do some extra work to clean up your code, it should be possible to write a custom manager method for the Generic model that "pre-fetches" all the subtype data from the relevant Specific tables for a given queryset, using only one query per subtype table; similar to how this snippet optimizes generic foreign keys with a bulk prefetch. This would give you the same queries as your second option, with the DRYer syntax of your first option.
Not a complete answer but you can avoid a great number of hits by doing this
items= list(items)
for item in items:
spec = getattr(item, item.get_my_specific_type())
instead of this :
for item in items:
spec = getattr(item, item.get_my_specific_type())
Indeed, by forcing a cast to a python list, you force the django orm to load all elements in your queryset. It then does this in one query.
I accidentally stubmled upon the following post, which pretty much answers your question :
http://lazypython.blogspot.com/2008/11/timeline-view-in-django.html