Django 1.11 annotation GROUP BY is wrong - django

With model relations like this
Risk <-- RiskGroup --> Group
I am trying to achieve a query resembling this
SELECT risk.id,
(
SELECT array_agg(bg.name) as names
FROM hazards_riskgroup rgroup
JOIN (SELECT * FROM base_group ORDER BY name) as bg
on rgroup.group_id = bg.id
WHERE risk_id = risk.id
GROUP BY risk_id
ORDER BY risk_id
) AS conf
FROM hazards_risk risk
ORDER BY conf NULLS LAST
;
And I have gotten as far as
group_names = (
RiskGroup.objects
.filter(risk_id=OuterRef('pk'))
.annotate(names=ArrayAgg('group__name'))
.order_by()
.order_by("group__name")
.values('names')
)
qs = Risk.objects.annotate(
conf=Subquery(group_names)
).order_by(F('conf').asc(nulls_last=True))
But that wil produce something along the lines of this query
SELECT risk.id,
(SELECT ARRAY_AGG(U2."name") AS "names"
FROM "hazards_riskgroup" U0
INNER JOIN "base_group" U2 ON (U0."group_id" = U2."id")
WHERE U0."risk_id" = (risk.id)
GROUP BY U0."id", U2."name"
ORDER BY U2."name" ASC) AS "conf"
FROM hazards_risk risk
ORDER BY conf NULLS LAST
Notice that the generated GROUP BY becomes GROUP BY U0."id", U2."name", where I want just GROUP BY U2."name".
Best I've been able to gather, is that it is related to some default ordering on models, but as you can tell, I've already accounted for that by inserting .order_by().
So I'm a little lost. I tried annotating with Raw sql instead, but in that case, I can't seem to be able to reference risk.id which is important to get the right group names for each risk.

Related

Django ORM and GROUP BY

Newcommer to Django here.
I'm currently trying to fetch some data from my model with a query that need would need a GROUP BY in SQL.
Here is my simplified model:
class Message(models.Model):
mmsi = models.CharField(max_length=16)
time = models.DateTimeField()
point = models.PointField(geography=True)
I'm basically trying to get the last Message from every distinct mmsi number.
In SQL that would translates like this for example:
select a.* from core_message a
inner join
(select mmsi, max(time) as time from core_message group by mmsi) b
on a.mmsi=b.mmsi and a.time=b.time;
After some tries, I managed to have something working similarly with Django ORM:
>>> mf=Message.objects.values('mmsi').annotate(Max('time'))
>>> Message.objects.filter(mmsi__in=mf.values('mmsi'),time__in=mf.values('time__max'))
That works, but I find my Django solution quite clumsy. Not sure it's the proper way to do it.
Looking at the underlying query this looks like this :
>>> print(Message.objects.filter(mmsi__in=mf.values('mmsi'),time__in=mf.values('time__max')).query)
SELECT "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea FROM "core_message" WHERE ("core_message"."mmsi" IN (SELECT U0."mmsi" FROM "core_message" U0 GROUP BY U0."mmsi") AND "core_message"."time" IN (SELECT MAX(U0."time") AS "time__max" FROM "core_message" U0 GROUP BY U0."mmsi"))
I'd appreciate if you could propose a better solution for this problem.
Thanks !
You only need something like this:
Message.objects.all().distinct('mmsi').values('mmsi', 'time').order_by('mmsi','-id')
or like this:
Message.objects.all().values('mmsi').annotate(date_last=Max('time'))
Note: the last is translate by Django in this sql query:
SELECT "message"."mmsi", MAX("message"."time") AS "date_last" FROM "message" GROUP BY "message"."mmsi", "message"."time" ORDER BY "message"."time" DESC
Using the answers and comments, I managed to solve this using a subquery or a simple distinct order by.
Simple distinct order by solution inspired by #Oriphiel answer:
Message.objects.distinct('mmsi').order_by('mmsi','-time')
The underlying SQL query looks like this :
SELECT DISTINCT ON ("core_message"."mmsi") "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea
FROM "core_message"
ORDER BY "core_message"."mmsi" ASC, "core_message"."time" DESC
Simple and straightforward.
Subquery solution inspired by #DanielRoseman comment:
time_order=Message.objects.filter(mmsi=OuterRef('mmsi')).order_by('-time')
Message.objects.filter(id__in=Subquery(time_order.values('id')[:1]))
The underlying SQL query looks like this :
SELECT "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea
FROM "core_message"
WHERE "core_message"."id" IN
(SELECT U0."id" FROM "core_message" U0 WHERE U0."mmsi" = ("core_message"."mmsi") ORDER BY U0."time" DESC LIMIT 1)
A tad more complex but it gives more flexibility. If I wanted to get first five messages for every MMSI, I'd just need to change the LIMIT value. In Django, it would look like this :
Message.objects.filter(id__in=Subquery(time_order.values('id')[:5]))

SqlAlchemy core union_all not adding parentheses

I have the following sample code:
queries = []
q1 = select([columns]).where(table.c.id == #).limit(#)
queries.append(q1)
q2 = select([columns]).where(table.c.id == #).limit(#)
queries.append(q2)
final_query = union_all(*queries)
The generated SQL should be this:
(select columns from table where id = # limit #)
UNION ALL
(select columns from table where id = # limit #)
But, I'm getting
select columns from table where id = # limit #
UNION ALL
select columns from table where id = # limit #
I tried using subquery, as follows for my queries:
q1 = subquery(select([columns]).where(table.c.id == #).limit(#))
The generated query then looks like this:
SELECT UNION ALL SELECT UNION ALL
I also tried doing
q1 = select([columns]).where(table.c.id == #).limit(#)).subquery()
But, I get the error:
'Select' object has no attribute 'subquery'
Any help to get the desired output with my subqueries wrapped in parentheses?
Note: this is not a duplicate of this question, because I'm not using Session.
EDIT
Okay, this works, but I don't believe it is very efficient, and it's adding an extra select * from (my sub query), but it works.
q1 = select('*').select_from((select(columns).where(table.c.id == #).limit(#)).alias('q1'))
So, if anyone has any ideas to optimize, or let me know if this is as good as it gets. I would appreciate it.
The author of SQLAlchemy seems to be aware of this and mentions a workaround for it on the SQLAlchemy 1.1 changelog page. The general idea is to do .alias().select() on each select.
stmt1 = select([table1.c.x]).order_by(table1.c.y).limit(1).alias().select()
stmt2 = select([table2.c.x]).order_by(table2.c.y).limit(2).alias().select()
stmt = union(stmt1, stmt2)

How to use subquery in django?

I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
(Purchase.objects
.all()
.distinct('customer')
.order_by('customer', '-date'))
It produces a query like:
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this https://stackoverflow.com/a/9796104/242969. Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
Customer.objects.all().annotate(most_recent_purchase=Max('purchase__date'))
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
"demo_customer"."user_id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
"demo_customer"."user_id"
Another option, would be adding a property to your customer model that would look something like this:
#property
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id" "shop_purchase.id" "shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "shop_purchase.date" DESC;
) AS result
ORDER BY date DESC;
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("shop_purchase.date")
FROM shop_purchase
GROUP BY 1
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
SELECT *
FROM shop_customer
WHERE "shop_customer.id" IN
(SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
(Purchase.objects.filter(
id__in=Purchase.objects
.values('customer_id')
.annotate(latest=Max('id'))
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY c.id DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
SELECT
json_build_object('comments',
array_agg(
json_build_object('id', id, 'user_id', user_id, 'body', body)
)
)
FROM (
SELECT
c.id,
c.user_id,
c.body
FROM app_comment c
WHERE c.article_id = app_article.id
ORDER BY c.id DESC
LIMIT 3
) sub
"""
})
for article in qs:
print(article.recent_comments)
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....

subquery in join with doctrine dql

I want to use DQL to create a query which looks like this in SQL:
select
e.*
from
e
inner join (
select
uuid, max(locale) as locale
from
e
where
locale = 'nl_NL' or
locale = 'nl'
group by
uuid
) as e_ on e.uuid = e_.uuid and e.locale = e_.locale
I tried to use QueryBuilder to generate the query and subquery. I think they do the right thing by them selves but I can't combine them in the join statement. Does anybody now if this is possible with DQL? I can't use native SQL because I want to return real objects and I don't know for which object this query is run (I only know the base class which have the uuid and locale property).
$subQueryBuilder = $this->_em->createQueryBuilder();
$subQueryBuilder
->addSelect('e.uuid, max(e.locale) as locale')
->from($this->_entityName, 'e')
->where($subQueryBuilder->expr()->in('e.locale', $localeCriteria))
->groupBy('e.uuid');
$queryBuilder = $this->_em->createQueryBuilder();
$queryBuilder
->addSelect('e')
->from($this->_entityName, 'e')
->join('('.$subQueryBuilder.') as', 'e_')
->where('e.uuid = e_.uuid')
->andWhere('e.locale = e_.locale');
You cannot put a subquery in the FROM clause of your DQL.
I will assume that your PK is {uuid, locale}, as of discussion with you on IRC. Since you also have two different columns in your query, this can become ugly.
What you can do is putting it into the WHERE clause:
select
e
from
MyEntity e
WHERE
e.uuid IN (
select
e2.uuid
from
MyEntity e2
where
e2.locale IN (:selectedLocales)
group by
e2.uuid
)
AND e.locale IN (
select
max(e3.locale) as locale
from
MyEntity e3
where
e3.locale IN (:selectedLocales)
group by
e3.uuid
)
Please note that I used a comparison against a (non empty) array of locales that you bind to to the :selectedLocales. This is to avoid destroying the query cache if you want to match against additional locales.
I also wouldn't suggest building this with the query builder if there's no real advantage in doing so since it will just make it simpler to break the query cache if you add conditionals dynamically (also, it's 3 query builders involved!)

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC