Django ORM subquery with window function

Django ORM subquery with window function - django

I'm trying to do this query with Django's ORM:
SELECT
id,
pn,
revision,
description
FROM (SELECT
id,
pn,
revision,
MAX(revision)
OVER (
PARTITION BY pn ) max_rev,
description
FROM table) maxarts
WHERE revision = max_rev
The result needs to be a queryset, i have tried every combination of Window/OuterRef/Subquery i know with no success.
Do i have to use a raw query?
Thanks in advance
Marco
EDIT #1:
I'll try to explain better, i have a model that looks like this:
class Article(models.Model):
pn = models.CharField()
revision = models.CharField()
description = models.CharField()
class Meta:
unique_together = [("pn", "revision"), ]
The data is something like:
pn1 rev1 description
pn1 rev2 description
pn2 rev1 anotherdescription
pn1 rev3 description
pn2 rev2 anotherdescription
I need to have a queryset containing only the Max("revision") value, which increments every time a user make a modfication to the object.
I hope that is more clear now. Thanks!
EDIT #2
As suggested i'm writing what i've already tried:
Raw SQL using the query written in the first message, selecting only the id field and passing it to the ORM as id__in=ids. Slow as hell, unusable.
Declared a WIndow function to use as filter:
Article.objects.annotate(
max_rev=Window(expression=Max("revision"), partition_by=F("pn"))
).filter(revision=F("max_rev"))
But Django complained that i cannot use a window function in a where clause (that's correct).
Then i've tried to use the window as subquery:
window_query = Article.objects.annotate(
max_rev=Window(expression=Max("revision"), partition_by=F("pn"))
)
result = Article.objects.filter(revision=Subquery(window_query))
I've tried also with OuterRef, to use the max_rev annotation as a join, no luck.
I'm out of ideas!

I think you can get what you are after, not much different to what you had, by using FirstValue rather than Max:
>>> window_query = Article.objects.annotate(max_id=Window(
expression=FirstValue("id"),
partition_by=F("pn"),
order_by=F("revision").desc()
)).values("max_id")
>>> list(Article.objects.filter(id__in=Subquery(window_query)))
[<Article: Article object (4)>, <Article: Article object (5)>]
This produces SQL like: SELECT * FROM articles_article WHERE id IN (SELECT FIRST_VALUE(id) OVER (PARTITION BY pn ORDER BY revision DESC) AS max_id FROM articles_article).
The subquery says order your window by revision descending, partitioned by pn, and take the first ID from each partition; then we use that in the parent query to fetch the relevant Articles for those IDs.
On PostgreSQL, you could also do:
>>> Article.objects.order_by('pn', '-revision').distinct('pn')
<QuerySet [<Article: Article object (4)>, <Article: Article object (5)>]>
This produces SQL like SELECT DISTINCT ON (pn) * FROM articles_article ORDER BY pn ASC, revision DESC.

So everytime a revision is made against an article, a row within the table is created?
If so, all you would need to do is perform a count query that counts all rows and groups them according to the 'pn' field. If you want to use the Max function, then I would suggest replacing the 'pn' field with an IntegerField or DecimalField rather than using a CharField. Although depending on where your application is at, that might be pretty difficult.
from django.db.models import Count
Article.objects.values('pn').annotate(maxvalues=Count('pn'))

Related

Get information from a model using unrelated field

I have these two models:
class A(models.Model):
name=models.CharField(max_length=10)
class D(models.Model):
code=models.IntegerField()
the code field can have a number that exists in model A but it cant be related due to other factors. But what I want know is to list items from A whose value is the same with code
items=D.objects.values('code__name')
would work but since they are not related nor can be related, how can I handle that?

You can use Subquery() expressions in Django 1.11 or newer.
from django.db.models import OuterRef, Subquery
code_subquery = A.objects.filter(id=OuterRef('code'))
qs = D.objects.annotate(code_name=Subquery(code_subquery.values('name')))
The output of qs is a queryset of objects D with an added field code_name.
Footnotes:
It is compiled to a very similar SQL (like the Bear Brown's solution with "extra" method, but without disadvantages of his solution, see there):
SELECT app_d.id, app_d.code,
(SELECT U0.name FROM app_a U0 WHERE U0.id = (app_d.code)) AS code_name
FROM app_d
If a dictionary output is required it can be converted by .values() finally. It can work like a left join i.e. if the pseudo related field allows null (code = models.IntegerField(none=True)) then the objects D are not restricted and the output code_name value could be None. A feature of Subquery is that it returns only one field expression must be eventually repeated for another fields. (That is similar to extra(select={...: "SELECT ..."}), but thanks to object syntax it can be more readable customized than an explicit SQL.)

you can use django extra, replace YOUAPP on your real app name
D.objects.extra(select={'a_name': 'select name from YOUAPP_a where id=code'}).values('a_name')
# Replace YOUAPP^^^^^

Django ORM: is it possible to inject subqueries?

I have a Django model that looks something like this:
class Result(models.Model):
date = DateTimeField()
subject = models.ForeignKey('myapp.Subject')
test_type = models.ForeignKey('myapp.TestType')
summary = models.PositiveSmallIntegerField()
# more fields about the result like its location, tester ID and so on
Sometimes we want to retrieve all the test results, other times we only want the most recent result of a particular test type for each subject. This answer has some great options for SQL that will find the most recent result.
Also, we sometimes want to bucket the results into different chunks of time so that we can graph the number of results per day / week / month.
We also want to filter on various fields, and for elegance I'd like a QuerySet that I can then make all the filter() calls on, and annotate for the counts, rather than making raw SQL calls.
I have got this far:
qs = Result.objects.extra(select = {
'date_range': "date_trunc('{0}', time)".format("day"), # Chunking into time buckets
'rn' : "ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)"})
qs = qs.values('date_range', 'result_summary', 'rn')
qs = qs.order_by('-date_range')
which results in the following SQL:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" ORDER BY "date_range" DESC
which is kind of approaching what I'd like, but now I need to somehow filter to only get the rows where rn = 1. I tried using the 'where' field in extra(), which gives me the following SQL and error:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" WHERE "rn"=1 ORDER BY "date_range" DESC ;
ERROR: column "rn" does not exist
So I think the query that finds "rn" needs to be a subquery - but is it possible to do that somehow, perhaps using extra()?
I know I could do this with raw SQL but it just looks ugly! I'd love to find a nice neat way where I have a filterable QuerySet.
I guess the other option is to have a field in the model that indicates whether it is actually the most recent result of that test type for that subject...

I've found a way!
qs = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
This is based on a different option from the answer I referenced earlier. I can filter or annotate qs with whatever I want.
As an aside, on the way to this solution I tried this:
qq = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
qs = Result.objects.filter(id__in=qq)
Django embeds the subquery just as you want it to:
SELECT ...some fields... FROM "myapp_result"
WHERE ("myapp_result"."id" IN (SELECT "myapp_result"."id" FROM "myapp_result"
WHERE (NOT EXISTS(SELECT * FROM myapp_result as T2
WHERE (T2.subject_id = myapp_result.subject_id AND T2.test_type_id = myapp_result.test_type_id AND T2.time > myapp_result.time)))))
I realised this had more subqueries than I need, but I note it here as I can imagine it being useful to know that you can filter one queryset with another and Django does exactly what you'd hope for in terms of embedding the subquery (rather than, say, executing it and embedding the returned values, which would be horrid.)

Django ORM: Retrieving posts and latest comments without performing N+1 queries

I have a very standard, basic social application -- with status updates (i.e., posts), and multiple comments per post.
Given the following simplified models, is it possible, using Django's ORM, to efficiently retrieve all posts and the latest two comments associated with each post, without performing N+1 queries? (That is, without performing a separate query to get the latest comments for each post on the page.)
class Post(models.Model):
title = models.CharField(max_length=255)
text = models.TextField()
class Comment(models.Model):
text = models.TextField()
post = models.ForeignKey(Post, related_name='comments')
class Meta:
ordering = ['-pk']
Post.objects.prefetch_related('comments').all() fetches all posts and comments, but I'd like to retrieve a limited number of comments per post only.
UPDATE:
I understand that, if this can be done at all using Django's ORM, it probably must be done with some version of prefetch_related. Multiple queries are totally okay, as long as I avoid making N+1 queries per page.
What is the typical/recommended way of handling this problem in Django?
UPDATE 2:
There seems to be no direct and easy way to do this efficiently with a simple query using the Django ORM. There are a number of helpful solutions/approaches/workarounds in the answers below, including:
Caching the latest comment IDs in the database
Performing a raw SQL query
Retrieving all comment IDs and doing the grouping and "joining" in python
Limiting your application to displaying the latest comment only
I didn't know which one to mark as correct because I haven't gotten a chance to experiment with all of these methods yet -- but I awarded the bounty to hynekcer for presenting a number of options.
UPDATE 3:
I ended up using #user1583799's solution.

If you're using Django 1.7 the new Prefetch objects—allowing you to customize the prefetch queryset—could prove helpful.
Unfortunately I can't think of a simple way to do exactly what you're asking. If you're on PostgreSQL and are willing to get just the latest comment for each post, the following should work in two queries:
comments = Comment.objects.order_by('post_id', '-id').distinct('post_id')
posts = Post.objects.prefetch_related(Prefetch('comments',
queryset=comments,
to_attr='latest_comments'))
for post in posts:
latest_comment = post.latest_comments[0] if post.latest_comments else None
Another variation: if your comments had a timestamp and you wanted to limit the comments to the most recent ones by date, that would look something like:
comments = Comment.objects.filter(timestamp__gt=one_day_ago)
...and then as above. Of course, you could still post-process the resulting list to limit the display to a maximum of two comments.

This solution is optimized for memory requirements, as you expect it important. It needs three queries. The first query asks for posts, the second query only for tuples (id, post_id). The third for details of filtered latest comments.
from itertools import groupby, islice
posts = Post.objects.filter(...some your flter...)
# sorted by date or by id
all_comments = (Comment.objects.filter(post__in=posts).values('post_id')
.order_by('post_id', '-pk'))
last_comments = []
# the queryset is evaluated now. Only about 100 itens chunks are in memory at
# once during iterations.
for post_id, related_comments in groupby(all_comments(), lambda x: x.post_id):
last_comments.extend(islice(related_comments, 2))
results = {}
for comment in Comment.objects.filter(pk__in=last_comments):
results.setdefault(comment.post_id, []).append(comment)
# output
for post in posts:
print post.title, [x.comment for x in results[post.id]]
But I think it will be faster for many database backends to combine the second and the third query into one and so to ask immediately for all fields of comments. Unuseful comments will be forgotten immediately.
The fastest solution would be with nested queries. The algorithm is like the one above, but everything is realized by raw SQL. It is limited only to some backends like PostgresQL.
EDIT
I agree that is not useful for you
... prefetch loads into memory thousands of comments, 99% of which will not be shown.
and therefore I wrote that relatively complicated solution that 99% of them will be read continuously without loading into memory.
EDIT
All examples are for the condition that you wand post_id in [1, 3, 5] (enything selected earlier by categories etc.)
In all cases create the index for Comments on fields ['post', 'pk']
A) Nested query for PostgresQL
SELECT post_id, id, text FROM
(SELECT post_id, id, text, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2
ORDER BY post_id, id
Or explicitely require with less memory if we don't believe the optimizer. It should read data only from index in two inner selects, which is much less data than from the table.:
SELECT post_id, id, text FROM app_comment WHERE id IN
(SELECT id FROM
(SELECT id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2)
ORDER BY post_id, id
B) With a cached ID of the oldest displayed comment
Add field "oldest_displayed" to Post
class Post(models.Model):
oldest_displayed = models.IntegerField()
Filter comments for pk if interesting posts (that you have selected earlier by categories etc.)
Filter
from django.db.models import F
qs = Comment.objects.filter(
post__pk__in=[1, 3, 5],
post__oldest_displayed__lte=F('pk')
).order_by('post_id', 'pk')
pprint.pprint([(x.post_id, x.pk) for x in qs])
Hmm, very nice ... and how it is compiled by Django?
>>> print(qs.query.get_compiler('default').as_sql()[0]) # added white space
SELECT "app_comment"."id", "app_comment"."text", "app_comment"."post_id"
FROM "app_comment"
INNER JOIN "app_post" ON ( "app_comment"."post_id" = "app_post"."id" )
WHERE ("app_comment"."post_id" IN (%s, %s, %s)
AND "app_post"."oldest_displayed" <= ("app_comment"."id"))
ORDER BY app_comment"."post_id" ASC, "app_comment"."id" ASC
Prepare all "oldest_displayed" by one nested SQL initially (and set zero for posts with less than two comments):
UPDATE app_post SET oldest_displayed = 0
UPDATE app_post SET oldest_displayed = qq.id FROM
(SELECT post_id, id FROM
(SELECT post_id, id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment ) sub
WHERE rank = 2) qq
WHERE qq.post_id = app_post.id;

prefetch_related('comments') will fetch all comments of the posts.
I had the same problem, and the database is Postgresql. I found a way:
Add a extra fieldrelated_replies. Note the FieldType is ArrayField, which support in django1.8dev. I copy the code to my project(the version of django is 1.7), just change 2 lines, it works.(or use djorm-pg-array )
class Post(models.Model):
related_replies = ArrayField(models.IntegerField(), size=10, null=True)
And use two queries:
posts = model.Post.object.filter()
related_replies_id = chain(*[p.related_replies for p in posts])
related_replies = models.Comment.objects.filter(
id__in=related_replies_id).select_related('created_by')[::1] # cache queryset
for p in posts:
p.get_related_replies = [r for r in related_replies if r.post_id == p.id]
When new comment comes, update related_replies.

Grouping Django model entries by day using its datetime field

I'm working with an Article like model that has a DateTimeField(auto_now_add=True) to capture the publication date (pub_date). This looks something like the following:
class Article(models.Model):
text = models.TextField()
pub_date = models.DateTimeField(auto_now_add=True)
I want to do a query that counts how many article posts or entries have been added per day. In other words, I want to query the entries and group them by day (and eventually month, hour, second, etc.). This would look something like the following in the SQLite shell:
select pub_date, count(id) from "myapp_article"
where id = 1
group by strftime("%d", pub_date)
;
Which returns something like:
2012-03-07 18:08:57.456761|5
2012-03-08 18:08:57.456761|9
2012-03-09 18:08:57.456761|1
I can't seem to figure out how to get that result from a Django QuerySet. I am aware of how to get a similar result using itertools.groupby, but that isn't possible in this situation (explanation to follow).
The end result of this query will be used in a graph showing the number of posts per day. I'm attempting to use the Django Chartit package to achieve this goal. Chartit puts a constraint on the data source (DataPool). The source must be a Model, Manager, or QuerySet, so using itertools.groupby is not an option as far as I can tell.
So the question is... How do I group or aggregate the entries by day and end up with a QuerySet object?

Create an extra field that only store date data(not time) and annotate with Count:
Article.objects.extra({'published':"date(pub_date)"}).values('published').annotate(count=Count('id'))
Result will be:
published,count
2012-03-07,5
2012-03-08,9
2012-03-09,1

What is the internal function in django to add new tables to a queryset in a sensible way?

In django 1.2:
I have a queryset with an extra parameter which refers to a table which is not currently included in the query django generates for this queryset.
If I add an order_by to the queryset which refers to the other table, django adds joins to the other table in the proper way and the extra works. But without the order_by, the extra parameter is failing. I could just add a useless secondary order_by to something in the other table, but I think there should be a better way to do it.
What is the django function to add joins in a sensible way? I know this must be getting called somewhere.
Here is some sample code. It selects all readings for a given user, and annotates the results with the rating (if any) given by another user stored in 'friend'.
class Book(models.Model):
name = models.CharField(max_length=200)
urlname = models.CharField(max_length=200)
entrydate=models.DateTimeField(auto_now_add=True)
class Reading(models.Model):
book=models.ForeignKey(Book,related_name='readings')
user=models.ForeignKey(User)
rating=models.IntegerField()
entrydate=models.DateTimeField(auto_now_add=True)
readings=Reading.objects.filter(user=user).order_by('entrydate')
friendrating='(select rating from proj_reading where user_id=%d and \
book_id=proj_book.id and rating in (1,2,3,4,5,6))'%friend.id
readings=readings.extra(select={'friendrating':friendrating})
at the moment, readings won't work because the join to readings is not set up correctly. however, if I add an order by such as:
.order_by('entrydate','reading__entrydate')
django magically knows to add an inner join through the foreign key and I get what I want.
additional information:
print readings.query ==>
select ((select rating from proj_reading where user_id=2 and book_id=proj_book.id and rating in (1,2,3,4,5,6)) as 'hisrating', proj_reading.id, proj_reading.user_id, proj_reading.rating, proj_reading.entrydate from proj_reading where proj_reading.user_id=1;
assuming
user.id=1
friend.id=2
the error is:
OperationalError: Unknown column proj_book.id in 'where clause'
and it happens because the table proj_book is not included in the query. To restate what I said above - if I now do readings2=readings.order_by('book__entrydate') I can see the proper join is set up and the query works.
Ideally I'd just like to figure out what the name of the qs.query function is that looks at two tables and figures out how they are joined by foreign keys, and just call that manually.

Your generated query:
select ((select rating from proj_reading where user_id=2 and book_id=proj_book.id and rating in (1,2,3,4,5,6)) as 'hisrating', proj_reading.id, proj_reading.user_id, proj_reading.rating, proj_reading.entrydate from proj_reading where proj_reading.user_id=1;
The db has no way to understand what does it mean by proj_book, since it is not included in (from tables or inner join).
You are getting expected results, when you add order_by, because that order_by query is adding inner join between proj_book and proj_reading.
As far as I understand, if you refer any other column in Book, not just order_by, you will get similar results.
Q1 = Reading.objects.filter(user=user).exclude(Book__name='') # Exclude forces to add JOIN
Q2 = "Select rating from proj_reading where user_id=%d" % user.id
Result = Q1.extra("foo":Q2)
This way, at step Q1, you are forcing DJango to add join on Book table, which is not default, unless you access any field of Book table.

you mean:
class SomeModel(models.Model)
id = models.IntegerField()
...
class SomeOtherModel(models.Model)
otherfield = models.ForeignKey(SomeModel)
qrst = SomeOtherModel.objects.filter(otherfield__id=1)
You can use "__" to create table joins.
EDIT:
It wont work because you do not define table join correctly.
myrating='(select rating from proj_reading inner join proj_book on (proj_book.id=proj_reading_id) where proj_reading.user_id=%d and rating in (1,2,3,4,5,6))'%user.id)'
This is a pesdocode and it is not tested.
But, i advice you to use django filters instead of writing sql queries.
read = Reading.objects.filter(book__urlname__icontains="smith", user_id=user.id, rating__in=(1,2,3,4,5,6)).values('rating')
Documentation for more details.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django ORM subquery with window function - django

Related

Get information from a model using unrelated field

Django ORM: is it possible to inject subqueries?

Django ORM: Retrieving posts and latest comments without performing N+1 queries

Grouping Django model entries by day using its datetime field

What is the internal function in django to add new tables to a queryset in a sensible way?

Categories

Resources