Grouping Django model entries by day using its datetime field - django

I'm working with an Article like model that has a DateTimeField(auto_now_add=True) to capture the publication date (pub_date). This looks something like the following:
class Article(models.Model):
text = models.TextField()
pub_date = models.DateTimeField(auto_now_add=True)
I want to do a query that counts how many article posts or entries have been added per day. In other words, I want to query the entries and group them by day (and eventually month, hour, second, etc.). This would look something like the following in the SQLite shell:
select pub_date, count(id) from "myapp_article"
where id = 1
group by strftime("%d", pub_date)
;
Which returns something like:
2012-03-07 18:08:57.456761|5
2012-03-08 18:08:57.456761|9
2012-03-09 18:08:57.456761|1
I can't seem to figure out how to get that result from a Django QuerySet. I am aware of how to get a similar result using itertools.groupby, but that isn't possible in this situation (explanation to follow).
The end result of this query will be used in a graph showing the number of posts per day. I'm attempting to use the Django Chartit package to achieve this goal. Chartit puts a constraint on the data source (DataPool). The source must be a Model, Manager, or QuerySet, so using itertools.groupby is not an option as far as I can tell.
So the question is... How do I group or aggregate the entries by day and end up with a QuerySet object?

Create an extra field that only store date data(not time) and annotate with Count:
Article.objects.extra({'published':"date(pub_date)"}).values('published').annotate(count=Count('id'))
Result will be:
published,count
2012-03-07,5
2012-03-08,9
2012-03-09,1

Related

Django: Joining on fields other than IDs (Using a date field in one model to pull data from a second model)

I'm attempting to use Django to build a simple website. I have a set of blog posts that have a date field attached to indicate the day they were published. I have a table that contains a list of dates and temperatures. On each post, I would like to display the temperature on the day it was published.
The two models are as follows:
class Post(models.Model):
title = models.CharField(max_length=200)
text = models.TextField()
date = models.DateField()
class Temperature(models.Model):
date = models.DateField()
temperature = models.IntegerField()
I would like to be able to reference the temperature field from the second table using the date field from the first. Is this possible?
In SQL, this is a simple query. I would do the following:
Select temperature from Temperature t join Post p on t.date = p.date
I think I really have two questions:
Is it possible to brute force this, even if it's not best practice? I've googled a lot and tried using raw sql and objects.extra, but can't get them to do what I want. I'm also wary of relying on them for the long haul.
Since this seems to be a simple task, it seems likely that I'm overcomplicating it by having my models set up sub-optimally. Is there something I'm missing about how I should design my models? That is, what's the best practice for doing something like this? (I've successfully pulled the temperature into my blog post by using a foreign key in the Temperature model. But if I go that route, I don't see how I could easily make sure that my temperature dates get the correct foreign key assigned to them so that the temperature date maps to the correct post date.)
There will likely be better answers than this one, but I'll throw in my 2¢ anyway.
You could try a property inside the Post model that returns the temperature:
#property
def temperature(self):
try:
return Temperature.objects.values_list('temperature',flat=True).get(date=self.date)
except:
return None
(code not tested)
About your Models:
If you will be displaying the temperature in a Post list (a list of Posts with their temperatures), then maybe it will be simpler to code and a faster query to just add a temperature field to your Post model.
You can keep the Temperature model. Then:
Assuming you have the temperature data already present in you Temperature model at the time of Post instance creation, you can fill that new field in a custom save method.
If you get temperature data after Post creation, you cann fill in that new temperature field through a background job (maybe triggered by crontab or similar).
Sometimes database orthogonality (not repeating info in many tables) is not the best strategy. Just something to think about, depending on how often you will be querying the Post models and how simple you want to keep that query code.
I think this might be a basic approach to solve the problem
post_dates = Post.objects.all().values('date')
result_temprature = Temperature.objects.filter(date__in = post_dates).values('temperature')
Subqueries could be your friend here. Something like the following should work:
from django.db.models import OuterRef, Subquery
temps = Temperature.objects.filter(date=OuterRef('date'))
posts = Post.objects.annotate(temperature=Subquery(temps.values('temperature')[:1]))
for post in posts:
temperature = post.temperature
Then you can just iterate through posts and access the temperature off each post instance

Django sum values of a column after 'group by' in another column

I found some solutions here and in the django documentation, but I could not manage to make one query work the way I wanted.
I have the following model:
class Inventory(models.Model):
blindid = models.CharField(max_length=20)
massug = models.IntegerField()
I want to count the number of Blind_ID and then sum the massug after they were grouped.
My currently Django ORM
samples = Inventory.objects.values('blindid', 'massug').annotate(aliquots=Count('blindid'), total=Sum('massug'))
It's not counting correctly (it shows only one), thus it 's not summing correctly. It seems it is only getting the first result... I tried to use Count('blindid', distinct=True) and Count('blindid', distinct=False) as well.
This is the query result using samples.query. Django is grouping by the two columns...
SELECT "inventory"."blindid", "inventory"."massug", COUNT("inventory"."blindid") AS "aliquots", SUM("inventory"."massug") AS "total" FROM "inventory" GROUP BY "inventory"."blindid", "inventory"."massug"
This should be the raw sql
SELECT blindid,
Count(blindid) AS aliquots,
Sum(massug) AS total
FROM inventory
GROUP BY blindid
Try this:
samples = Inventory.objects.values('blindid').annotate(aliquots=Count('blindid'), total=Sum('massug'))

How to write query of group by

I have model named IssueFlags with columns:
id, created_at, flags_id, issue_id, comments
I want to get data of unique issue_id (latest created) with info about flags_id, created_at and comments
By sql it's working like this:
SELECT created_at, flags_id, issue_id, comments
FROM Issues_issueflags
group by issue_id
How to do the same in Django? I tried to wrote sth in shell, but there is no attribute group by
IssueFlags.objects.order_by('-created_at')
This above return me only the list of ordered data.
Try doing this way:
from django.db.models import Count
IssueFlags.objects.values('created_at', 'flags_id', 'issue_id', 'comments').order_by('-created_at').annotate(total=Count('issue_id'))
I have written annotate(total=Count('issue_id')) assuming that you would have multiple entries of unique issue_id (Note that you can do all possible types of aggregations like Sum, Count, Max, Avg inside . Also, there already exists answers for, doing group by in django. Also have a look at this link or this link. Also, read this django documentation to get a clear idea on when to place values() before annotate() and when to place it after, and then implement the learning as per your requirement.
Would be happy to help if you have any further doubts.

Django ORM: Retrieving posts and latest comments without performing N+1 queries

I have a very standard, basic social application -- with status updates (i.e., posts), and multiple comments per post.
Given the following simplified models, is it possible, using Django's ORM, to efficiently retrieve all posts and the latest two comments associated with each post, without performing N+1 queries? (That is, without performing a separate query to get the latest comments for each post on the page.)
class Post(models.Model):
title = models.CharField(max_length=255)
text = models.TextField()
class Comment(models.Model):
text = models.TextField()
post = models.ForeignKey(Post, related_name='comments')
class Meta:
ordering = ['-pk']
Post.objects.prefetch_related('comments').all() fetches all posts and comments, but I'd like to retrieve a limited number of comments per post only.
UPDATE:
I understand that, if this can be done at all using Django's ORM, it probably must be done with some version of prefetch_related. Multiple queries are totally okay, as long as I avoid making N+1 queries per page.
What is the typical/recommended way of handling this problem in Django?
UPDATE 2:
There seems to be no direct and easy way to do this efficiently with a simple query using the Django ORM. There are a number of helpful solutions/approaches/workarounds in the answers below, including:
Caching the latest comment IDs in the database
Performing a raw SQL query
Retrieving all comment IDs and doing the grouping and "joining" in python
Limiting your application to displaying the latest comment only
I didn't know which one to mark as correct because I haven't gotten a chance to experiment with all of these methods yet -- but I awarded the bounty to hynekcer for presenting a number of options.
UPDATE 3:
I ended up using #user1583799's solution.
If you're using Django 1.7 the new Prefetch objects—allowing you to customize the prefetch queryset—could prove helpful.
Unfortunately I can't think of a simple way to do exactly what you're asking. If you're on PostgreSQL and are willing to get just the latest comment for each post, the following should work in two queries:
comments = Comment.objects.order_by('post_id', '-id').distinct('post_id')
posts = Post.objects.prefetch_related(Prefetch('comments',
queryset=comments,
to_attr='latest_comments'))
for post in posts:
latest_comment = post.latest_comments[0] if post.latest_comments else None
Another variation: if your comments had a timestamp and you wanted to limit the comments to the most recent ones by date, that would look something like:
comments = Comment.objects.filter(timestamp__gt=one_day_ago)
...and then as above. Of course, you could still post-process the resulting list to limit the display to a maximum of two comments.
This solution is optimized for memory requirements, as you expect it important. It needs three queries. The first query asks for posts, the second query only for tuples (id, post_id). The third for details of filtered latest comments.
from itertools import groupby, islice
posts = Post.objects.filter(...some your flter...)
# sorted by date or by id
all_comments = (Comment.objects.filter(post__in=posts).values('post_id')
.order_by('post_id', '-pk'))
last_comments = []
# the queryset is evaluated now. Only about 100 itens chunks are in memory at
# once during iterations.
for post_id, related_comments in groupby(all_comments(), lambda x: x.post_id):
last_comments.extend(islice(related_comments, 2))
results = {}
for comment in Comment.objects.filter(pk__in=last_comments):
results.setdefault(comment.post_id, []).append(comment)
# output
for post in posts:
print post.title, [x.comment for x in results[post.id]]
But I think it will be faster for many database backends to combine the second and the third query into one and so to ask immediately for all fields of comments. Unuseful comments will be forgotten immediately.
The fastest solution would be with nested queries. The algorithm is like the one above, but everything is realized by raw SQL. It is limited only to some backends like PostgresQL.
EDIT
I agree that is not useful for you
... prefetch loads into memory thousands of comments, 99% of which will not be shown.
and therefore I wrote that relatively complicated solution that 99% of them will be read continuously without loading into memory.
EDIT
All examples are for the condition that you wand post_id in [1, 3, 5] (enything selected earlier by categories etc.)
In all cases create the index for Comments on fields ['post', 'pk']
A) Nested query for PostgresQL
SELECT post_id, id, text FROM
(SELECT post_id, id, text, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2
ORDER BY post_id, id
Or explicitely require with less memory if we don't believe the optimizer. It should read data only from index in two inner selects, which is much less data than from the table.:
SELECT post_id, id, text FROM app_comment WHERE id IN
(SELECT id FROM
(SELECT id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2)
ORDER BY post_id, id
B) With a cached ID of the oldest displayed comment
Add field "oldest_displayed" to Post
class Post(models.Model):
oldest_displayed = models.IntegerField()
Filter comments for pk if interesting posts (that you have selected earlier by categories etc.)
Filter
from django.db.models import F
qs = Comment.objects.filter(
post__pk__in=[1, 3, 5],
post__oldest_displayed__lte=F('pk')
).order_by('post_id', 'pk')
pprint.pprint([(x.post_id, x.pk) for x in qs])
Hmm, very nice ... and how it is compiled by Django?
>>> print(qs.query.get_compiler('default').as_sql()[0]) # added white space
SELECT "app_comment"."id", "app_comment"."text", "app_comment"."post_id"
FROM "app_comment"
INNER JOIN "app_post" ON ( "app_comment"."post_id" = "app_post"."id" )
WHERE ("app_comment"."post_id" IN (%s, %s, %s)
AND "app_post"."oldest_displayed" <= ("app_comment"."id"))
ORDER BY app_comment"."post_id" ASC, "app_comment"."id" ASC
Prepare all "oldest_displayed" by one nested SQL initially (and set zero for posts with less than two comments):
UPDATE app_post SET oldest_displayed = 0
UPDATE app_post SET oldest_displayed = qq.id FROM
(SELECT post_id, id FROM
(SELECT post_id, id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment ) sub
WHERE rank = 2) qq
WHERE qq.post_id = app_post.id;
prefetch_related('comments') will fetch all comments of the posts.
I had the same problem, and the database is Postgresql. I found a way:
Add a extra fieldrelated_replies. Note the FieldType is ArrayField, which support in django1.8dev. I copy the code to my project(the version of django is 1.7), just change 2 lines, it works.(or use djorm-pg-array )
class Post(models.Model):
related_replies = ArrayField(models.IntegerField(), size=10, null=True)
And use two queries:
posts = model.Post.object.filter()
related_replies_id = chain(*[p.related_replies for p in posts])
related_replies = models.Comment.objects.filter(
id__in=related_replies_id).select_related('created_by')[::1] # cache queryset
for p in posts:
p.get_related_replies = [r for r in related_replies if r.post_id == p.id]
When new comment comes, update related_replies.

django: time range based aggregate query

I have the following models, Art and ArtScore:
class Art(models.Model):
title = models.CharField()
class ArtScore(models.Model):
art = models.ForeignKey(Art)
date = models.DateField(auto_now_add = True)
amount = models.IntegerField()
Certain user actions results in an ArtScore entry, for instance whenever you click 'I like this art', I save a certain amount of ArtScore for that Art.
Now I'm trying to show a page for 'most popular this week', so I need a query aggregating only ArtScore amounts for that time range.
I built the below query but it's flawed...
popular = Art.objects.filter(
artscore__date__range=(weekago, today)
).annotate(
score=Sum('artscore__amount')
).order_by('-score')
... because it only excludes Art that doesn't have an ArtScore record in the date range, but does not exclude the ArtScore records outside the date range.
Any pointers how to accomplish this would be appreciated!
Thanks,
Martin
it looks like according to this:
http://docs.djangoproject.com/en/dev/topics/db/aggregation/#order-of-annotate-and-filter-clauses what you have should do what you want. is the documentation wrong? bug maybe?
"the second query will only include
good books in the annotated count."
in regards to:
>>> Publisher.objects.filter(book__rating__gt=3.0).annotate(num_books=Count('book'))
change this to your query (forget about the order for now):
Art.objects.filter(
artscore__date__range=(weekago, today)
).annotate(
score=Sum('artscore__amount')
)
now we can say
"the query will only include artscores
in the date range in the annotated
sum."