Django queryset on related field - django

Django is to making a query much more complicated than it needs to be.
A Sentiment may have a User and a Card, and I am getting the Cards which are not in the passed User's Sentiments
This is the query:
Card.objects.all().exclude(sentiments__in=user.sentiments.all())
this is what Django runs:
SELECT * FROM "cards_card"
WHERE NOT ("cards_card"."id" IN (
SELECT V1."card_id" AS "card_id"
FROM "sentiments_sentiment" V1
WHERE V1."id" IN (
SELECT U0."id"
FROM "sentiments_sentiment" U0
WHERE U0."user_id" = 1
)
)
)
This is a version I came up with which didn't do an N-Times full table scan:
Card.objects.raw('
SELECT DISTINCT "id"
FROM "cards_card"
WHERE NOT "id" IN (
SELECT "card_id"
FROM "sentiments_sentiment"
WHERE "user_id" = ' + user_id + '
)
)')
I don't know why Django has to do it with the N-Times scan. I've been scouring the web for answers, but nothing so far. Any suggestions on how to keep the performance but not have to fall back to raw SQL?

A better way of writing this query without the subqueries would be:
Card.objects.all().exclude(sentiments__user__id=user.id)

Related

Django ORM and GROUP BY

Newcommer to Django here.
I'm currently trying to fetch some data from my model with a query that need would need a GROUP BY in SQL.
Here is my simplified model:
class Message(models.Model):
mmsi = models.CharField(max_length=16)
time = models.DateTimeField()
point = models.PointField(geography=True)
I'm basically trying to get the last Message from every distinct mmsi number.
In SQL that would translates like this for example:
select a.* from core_message a
inner join
(select mmsi, max(time) as time from core_message group by mmsi) b
on a.mmsi=b.mmsi and a.time=b.time;
After some tries, I managed to have something working similarly with Django ORM:
>>> mf=Message.objects.values('mmsi').annotate(Max('time'))
>>> Message.objects.filter(mmsi__in=mf.values('mmsi'),time__in=mf.values('time__max'))
That works, but I find my Django solution quite clumsy. Not sure it's the proper way to do it.
Looking at the underlying query this looks like this :
>>> print(Message.objects.filter(mmsi__in=mf.values('mmsi'),time__in=mf.values('time__max')).query)
SELECT "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea FROM "core_message" WHERE ("core_message"."mmsi" IN (SELECT U0."mmsi" FROM "core_message" U0 GROUP BY U0."mmsi") AND "core_message"."time" IN (SELECT MAX(U0."time") AS "time__max" FROM "core_message" U0 GROUP BY U0."mmsi"))
I'd appreciate if you could propose a better solution for this problem.
Thanks !
You only need something like this:
Message.objects.all().distinct('mmsi').values('mmsi', 'time').order_by('mmsi','-id')
or like this:
Message.objects.all().values('mmsi').annotate(date_last=Max('time'))
Note: the last is translate by Django in this sql query:
SELECT "message"."mmsi", MAX("message"."time") AS "date_last" FROM "message" GROUP BY "message"."mmsi", "message"."time" ORDER BY "message"."time" DESC
Using the answers and comments, I managed to solve this using a subquery or a simple distinct order by.
Simple distinct order by solution inspired by #Oriphiel answer:
Message.objects.distinct('mmsi').order_by('mmsi','-time')
The underlying SQL query looks like this :
SELECT DISTINCT ON ("core_message"."mmsi") "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea
FROM "core_message"
ORDER BY "core_message"."mmsi" ASC, "core_message"."time" DESC
Simple and straightforward.
Subquery solution inspired by #DanielRoseman comment:
time_order=Message.objects.filter(mmsi=OuterRef('mmsi')).order_by('-time')
Message.objects.filter(id__in=Subquery(time_order.values('id')[:1]))
The underlying SQL query looks like this :
SELECT "core_message"."id", "core_message"."mmsi", "core_message"."time", "core_message"."point"::bytea
FROM "core_message"
WHERE "core_message"."id" IN
(SELECT U0."id" FROM "core_message" U0 WHERE U0."mmsi" = ("core_message"."mmsi") ORDER BY U0."time" DESC LIMIT 1)
A tad more complex but it gives more flexibility. If I wanted to get first five messages for every MMSI, I'd just need to change the LIMIT value. In Django, it would look like this :
Message.objects.filter(id__in=Subquery(time_order.values('id')[:5]))

How to enforce Django to use "JOIN VALUES"

I'm having a performance problem where I need to replace section of my query statement. Right now I have a the following:
select count(*) FROM "mytable" WHERE "field" IN ('v1', 'v2', ..., 'vN');
this can be translated to Django ORM:
Mytable.objects.all().filter(field__in=[myvalues]).count()
I need to do the following though:
select count(*) FROM "mytable" JOIN (values ('v1', 'v2', ..., 'vN')) as lookup(value) on lookup.value = "mytable".field;
Is there a way to add this to the ORM? I need to do with ORM because I already have other filters. Worst case scenario I thought of getting the query string and adding there manually...
I'm using Postgresql 9.6
I found a way after reading over and over the documentation. I even found a patch that was not merged a while ago.
It doesn't really do the join, but it works much faster than using __in straightforward.
What I'm doing is executing a RawSQL() that was introduced in Django 2.0 and with that result I do the __in again.
So here is a code example:
query = """select myfield from mytable join (values
('v1'), ('v2'), ..., ('vN')
) as lookup(value) on lookup.value = mytable.myfield"""
r = RawSQL(query, [])
mymodel.filter(myfield__in=r)
Now it takes miliseconds instead of minutes!

Sort posts by newest child's timestamp or own timestamp

I have models in my django app that have a post/reply relationship and am trying to sort the posts by the time of their latest reply OR, if there are no replies, their own timestamp. This is what I have now:
threads = ConversationThread.objects.extra(select={'sort_date':
"""select case when (select count(*) from conversation_conversationpost
where conversation_conversationpost.thread_id = conversation_conversationthread.id) > 0
then (select max(conversation_conversationpost.post_date)
from conversation_conversationpost where conversation_conversationpost.thread_id = conversation_conversationthread.id)
else conversation_conversationthread.post_date end"""}).order_by('-sort_date')
Though it works, I have a hunch that this isn't the most succinct or efficient way to do this. What would be a better way?
SELECT *,
(
SELECT COALESCE(MAX(cp.post_date), ct.post_date)
FROM conversation_conversationpost cp
WHERE cp.thread_id = ct.id
) AS sort_date
FROM conversation_conversationthread ct
ORDER BY
sort_date DESC
Correlated subqueries are notoriously slow.
A LEFT JOIN might be substantially faster (depending on data distribution):
SELECT t.*, COALESCE(p.max_post_date, t.post_date) AS sort_date
FROM conversation_conversationthread t
LEFT JOIN (
SELECT thread_id, MAX(post_date) AS max_post_date
FROM conversation_conversationpost
GROUP BY thread_id
) p ON p.thread_id = t.id
ORDER BY sort_date DESC;

Doctrine 2 edit DQL in entity

I have several database tables with 2 primary keys, id and date. I do not update the records but instead insert a new record with the updated information. This new record has the same id and the date field is NOW(). I will use a product table to explain my question.
I want to be able to request the product details at a specific date. I therefore use the following subquery in DQL, which works fine:
WHERE p.date = (
SELECT MAX(pp.date)
FROM Entity\Product pp
WHERE pp.id = p.id
AND pp.date < :date
)
This product table has some referenced tables, like category. This category table has the same id and date primary key combination. I want to be able to request the product details and the category details at a specific date. I therefore expanded the DQL as shown above to the following, which also works fine:
JOIN p.category c
WHERE p.date = (
SELECT MAX(pp.date)
FROM Entity\Product pp
WHERE pp.id = p.id
AND pp.date < :date
)
AND c.date = (
SELECT MAX(cc.date)
FROM Entity\ProductCategory cc
WHERE cc.id = c.id
AND cc.date < :date
)
However, as you can see, if I have multiple referenced tables I will have to copy the same piece of DQL. I want to somehow add these subqueries to the entities so that every time an entity is called it adds this subquery.
I have thought of adding this in a __construct($date) or some kind of setUp($date) method, but I'm kind of stuck here. Also, would it help to add #Id to Entity\Product::date?
I hope someone can help me. I do not expect a complete solution, one step in a good direction would be very much appreciated.
I think I've found my solution. The trick was (first, to update to Doctrine 2.2 and) using a filter:
namespace Filter;
use Doctrine\ORM\Mapping\ClassMetaData,
Doctrine\ORM\Query\Filter\SQLFilter;
class VersionFilter extends SQLFilter {
public function addFilterConstraint(ClassMetadata $targetEntity, $targetTableAlias) {
$return = $targetTableAlias . '.date = (
SELECT MAX(sub.date)
FROM ' . $targetEntity->table['name'] . ' sub
WHERE sub.id = ' . $targetTableAlias . '.id
AND sub.date < ' . $this->getParameter('date') . '
)';
return $return;
}
}
Add the filter to the configuration:
$configuration->addFilter("version", Filter\VersionFilter");
And enable it in my repository:
$this->_em->getFilters()->enable("version")->setParameter('date', $date);

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC