Django: how to filter() after distinct() - django

If we chain a call to filter() after a call to distinct(), the filter is applied to the query before the distinct. How do I filter the results of a query after applying distinct?
Example.objects.order_by('a','foreignkey__b').distinct('a').filter(foreignkey__b='something')
The where clause in the SQL resulting from filter() means the filter is applied to the query before the distinct. I want to filter the queryset resulting from the distinct.
This is probably pretty easy, but I just can't quite figure it out and I can't find anything on it.
Edit 1:
I need to do this in the ORM...
SELECT z.column1, z.column2, z.column3
FROM (
SELECT DISTINCT ON (b.column1, b.column2) b.column1, b.column2, c.column3
FROM table1 a
INNER JOIN table2 b ON ( a.id = b.id )
INNER JOIN table3 c ON ( b.id = c.id)
ORDER BY b.column1 ASC, b.column2 ASC, c.column4 DESC
) z
WHERE z.column3 = 'Something';
(I am using Postgres by the way.)
So I guess what I am asking is "How do you nest subqueries in the ORM? Is it possible?" I will check the documentation.
Sorry if I was not specific earlier. It wasn't clear in my head.

This is an old question, but when using Postgres you can do the following to force nested queries on your 'Distinct' rows:
foo = Example.objects.order_by('a','foreign_key__timefield').distinct('a')
bar = Example.objects.filter(pk__in=foo).filter(some_field=condition)
bar is the nested query as requested in OP without resorting to raw/extra etc. Tested working in 1.10 but docs suggest it should work back to at least 1.7.
My use case was to filter up a reverse relationship. If Example has some ForeignKey to model Toast then you can do:
Toast.objects.filter(pk__in=bar.values_list('foreign_key',flat=true))
This gives you all instances of Toast where the most recent associated example meets your filter criteria.
Big health warning about performance though, using this if bar is likely to be a huge queryset you're probably going to have a bad time.

Thanks a ton for the help guys. I tried both suggestions and could not bend either of those suggestions to work, but I think it started me in the right direction.
I ended up using
from django.db.models import Max, F
Example.objects.annotate(latest=Max('foreignkey__timefield')).filter(foreignkey__timefield=F('latest'), foreign__a='Something')
This checks what the latest foreignkey__timefield is for each Example, and if it is the latest one and a=something then keep it. If it is not the latest or a!=something for each Example then it is filtered out.
This does not nest subqueries but it gives me the output I am looking for - and it is fairly simple. If there is simpler way I would really like to know.

No you can't do this in one simple SELECT.
As you said in comments, in Django ORM filter is mapped to SQL clause WHERE, and distinct mapped to DISTINCT. And in a SQL, DISTINCT always happens after WHERE by operating on the result set, see SQLite doc for example.
But you could write sub-query to nest SELECTs, this depends on the actual target (I don't know exactly what's yours now..could you elaborate it more?)
Also, for your query, distinct('a') only keeps the first occurrence of Example having the same a, is that what you want?

Related

How can I annotate another annotate group by query in Django?

I have two queries:
Proyecto.objects.filter().order_by('tipo_proyecto')
Proyecto.objects.values('tipo_proyecto').annotate(total=Sum('techo_presupuestario'))
How can I make this in only one query? I want that the first query contains an annotate data that represents all sums of techo_presupuestario, depending on your tipo_proyecto. Is this posible?
If I understand you correct, you'd like to add a conditionally aggregated sum over one field to each object, so you get each object with a sum fitting to its tipo_proyecto. Right?
I don't know, if this makes sense, but it could be done anyway using Subquery:
from django.db.models import Sum, Subquery, OuterRef
sq = Subquery(Proyecto.objects.filter(tipo_proyecto=OuterRef("tipo_proyecto")).values("tipo_proyecto").annotate(techoSum=Sum("techo_presupuestario")).values("techoSum"))
Proyecto.objects.all().annotate(tipoTechoSum = sq).order_by('tipo_proyecto')
Nonetheless, I wouldn't recommend this, as it puts some heavy load on your database. (In MySQL there will be an nested SELECT statement referring to the same table, which might be pretty unpleasant depending on the table's size.)
I'd say the better approach is to "collect" your aggregated sums separately and add the values to the model objects in your code.

Django does a useless join when filtering

I have an optional foreign key for a parameter:
class Club(models.Model):
...
locationid = models.ForeignKey(location_models.Location, null=True)
...
I want to find entries of club where this foreignkey is not set. Here's the ORM query:
print Club.objects.filter(locationid=None).only('name').query
Produces
SELECT `club_club`.`id`, `club_club`.`name` FROM `club_club`
LEFT OUTER JOIN `location_location`
ON (`club_club`.`locationid_id` = `location_location`.`id`)
WHERE `location_location`.`id` IS NULL
Same query is produced when I do filter(locationid_id__isnull=True)
What I want is to query on locationid_id without involving a JOIN. I know I can write raw SQL, but is there an ORM-al way of doing this?
This seems to be quite a persistent issue, and the patch that fixed it had other side effects, so it never got applied to a release version of Django.
A solution to this is to use the extra method. This will require raw SQL, but only a limited amount and using SQL standards, so it should be compatible with all SQL databases:
location_null = '`%s`.`%s` IS NULL' % (Club._meta.db_table, Club.locationid.field.column)
Club.objects.extra(where=[location_null])
You can add this as a manager/queryset method for a more DRY solution.
The other option is to just take the performance hit. This is what I would recommend, unless benchmarking shows that the performance hit really is unacceptable in your specific case.

Doctrine2 DQL join with unrelated tables to fetch both entities

My DQL query returns only the FROM object, which is nice if the other object were related, but it isn't.
My Query:
$query = $this->em->createQuery('SELECT c, s FROM MyBundle:Person c, MyBundle:Spot s
JOIN s.geo_data g JOIN g.features f WHERE f.active = true AND
ST_Distance(f.location, c.location) < :distance GROUP BY c, s');
This works perfectly in SQL, giving me all the spots and all the persons within :distance of them. But in DQL, it only returns the person object, and since on the database level they are not related, I have no way to fetch the correct spot.
My database setup is correct, I'm using a PostGIS backend and spots and persons are not related in any way. They just happen to be on the same map and I'm querying for spatial relationships.
According to documentation, it's intended behaviour, from what I read, s is being hydrated, but not returned anywhere at all, good job!
How can I teach DQL to please return me what I told it in SELECT? Where's the "I mean what I say, stop being a smartass" switch?
Doctrine cannot give you both entities if they are not related because if the were related you would get the first entity c where you could get s through the relation.
What you can try is selecting all fields of both entities like
SELECT c.location, ..., s.geo_data, ...
This will give you an array for each column that contains all fields from both entities.
Maybe you can use result set mapping to get the entities if desired.
If you want to stuck with Doctrine, you HAVE TO define a OneToMany relation between places and people. In this way, you could set up the PeopleRepository and set up a method like getPeopleByLocationAndMaxDistance(Location $location, $distance)
SELECT p
FROM People AS p
LEFT JOIN Places AS pl
WHERE ST_Distance(p.location, pl.location) < :distance

Django Sum & Count

I have some MySQL code that looks like this:
SELECT
visitor AS team,
COUNT(*) AS rg,
SUM(vscore>hscore) AS rw,
SUM(vscore<hscore) AS rl
FROM `gamelog` WHERE status='Final'
AND date(start_et) BETWEEN %s AND %s GROUP BY visitor
I'm trying to translate this into a Django version of that query, without making multiple queries. Is this possible? I read up on how to do Sum(), and Count(), but it doesn't seem to work when I want to compare two fields like I'm doing.
Here's the best I could come up with so far, but it didn't work...
vrecord = GameLog.objects.filter(start_et__range=[start,end],visitor=i['id']
).aggregate(
Sum('vscore'>'hscore'),
Count('vscore'>'hscore'))
I also tried using 'vscore>hscore' in there, but that didn't work either. Any ideas? I need to use as few queries as possible.
Aggregation only works on single fields in the Django ORM. I looked at the code for the various aggregation functions, and noticed that the single-field restriction is hardwired. Basically, when you use, say, Sum(field), it just records that for later, then it passes it to the database-specific backend for conversion to SQL and execution. Apparently, aggregation and annotation are not standardized in SQL.
Anyway, you probably need to use a raw SQL query.

is distinct an expensive query in django?

I have three models: Product, Category and Place.
Product has ManyToMany relation with Category and Place.
I need to get a list of categories with at least on product matching a specific place.
For example I might need to get all the categories that has at least one product from Boston.
I have 100 categories, 500 places and 100,000 products.
In sqlite with 10K products the query takes ~ a second.
In production I'll use postgresql.
I'm using:
categories = Category.objects.distinct().filter(product__place__name="Boston")
Is this query going to be expensive?
Is there a better way to do this?
This is the result of connection.queries
{'time': '0.929', 'sql': u'SELECT DISTINCT "catalog_category"."id", "catalog_category"."name" FROM "catalog_category" INNER JOIN "catalog_product_categories" ON ("catalog_category"."id" = "catalog_product_categories"."category_id") INNER JOIN "catalog_product" ON ("catalog_product_categories"."product_id" = "catalog_product"."id") INNER JOIN "catalog_product_places" ON ("catalog_product"."id" = "catalog_product_places"."product_id") INNER JOIN "catalog_place" ON ("catalog_product_places"."car_id" = "catalog_car"."id") WHERE "catalog_place"."name" = Boston ORDER BY "catalog_category"."name" ASC'}]
Thanks
This is not just a Django issue; DISTINCT is slow on most SQL implementations because it's a relatively hard operation. Here is a good discussion of why it's slow in Postgres specifically.
One way to handle this would be to use Django's excellent caching mechanism on this query, assuming that the results don't change often and minor staleness isn't a problem. Another approach would be to keep a separate list of just the distinct categories, perhaps in another table.
Although Chase is right that DISTINCT is generally a slow operation, in this case it is also completely pointless. As you can see from the generated SQL, the DISTINCT is being done on the combination of ID and name - which will never be duplicated anyway. So there is no need for the distinct() call in this query.
Generally, Django does not return duplicate results from a simple filter. The main time when distinct() is useful is when you are accessing a related queryset via a ManyToMany or ForeignKey relationship, where multiple items might be related to the same instance, and distinct will remove the duplicates.