Django: Get latest N number of records per group - django

Let's say I have the following Django model:
class Team(models.Model):
name = models.CharField(max_length=255)
created_at = models.DateTimeField(auto_now_add=True)
I want to write a query to fetch the latest N number of records per team name.
If N=1, the query is very easy (assuming I'm using postgres because it's the only DB that support distinct(*fields)):
Team.objects.order_by("name", "-created_at").distinct("name")
If N is greater than 1 (let's say 3), then it gets tricky. How can I write this query in Django?

Not sure how you can get duplicate names per team, since you have unique=True. But if you plan to remove that to support non-unique names, you can use subqueries like this:
top_3_per_team_name = Team.objects.filter(
name=OuterRef("name")
).order_by("-created_at")[:3]
Team.objects.filter(
id__in=top_3_per_team_name.values("id")
)
Although this can be a bit slow, so make sure you have the indexes setup.
Also to note, ideally this can be solved by using Window..[Django-doc] functions using DenseRank..[Django-doc] but unfortunately the latest django version can't filter on windows:
from django.db.models import F
from django.db.models.expressions import Window
from django.db.models.functions import DenseRank
Team.objects.annotate(
rank=Window(
expression=DenseRank(),
partition_by=[F('name'),],
order_by=F('created_at').desc()
),
).filter(rank__in=range(1,4)) # 4 is N + 1 if N = 3
With the above you get:
NotSupportedError: Window is disallowed in the filter clause.
But there is a plan to support this on Django 4.2 so theoretically the above should work once that is released.

I'm assuming you'll be getting your N from a get request or something, but as long as you have a number you can try limiting your queryset:
Team.objects.order_by("name", "-created_at").distinct("name")[:3] # for N = 3

Related

Sum greatest values of each day from period with Django Query

My proj has a model that goes like:
class Data(Model):
data = FloatField(verbose_name='Data', null=True, blank=True)
created_at = DateTimeField(verbose_name='Created at')
And my app creates a few hundred logs of this model per day.
I'm trying to sum only the greatest values of each day, without having to iterate over them (using only Django queries).
Is it possible without writing SQL queries?
PS: I'm able to get the greatest 'data' of each day, so the current logic iterates over days and sums the greatest values of each day. But that solution is becoming too slow and I'd like to solve it directly into db level.
Annotations and aggregates to the rescue:
from django.db.models import Sum, Max
from django.db.models.functions import Trunc
report = (Data.objects
.annotate(day=Trunc('created_at', 'day'))
.values('day')
.annotate(greatest=Max('data'))
.values('greatest')
.aggregate(total=Sum('greatest'))
)
print(report['total'])
The resulting SQL is almost simpler than the code:
SELECT SUM("greatest")
FROM
(SELECT MAX("app_data"."data_id") AS "greatest"
FROM "app_data"
GROUP BY DATE_TRUNC('day', "app_data"."created_at")) subquery
If you are using a database backed that supports distinct on fields (like postgres does) you can do.
Data.objects.order_by('created_at__date', '-data').distinct('created_at__date')

how does django query work?

my models are designed like so
class Warehouse:
name = ...
sublocation = FK(Sublocation)
class Sublocation:
name = ...
city = FK(City)
class City:
name = ..
state = Fk(State)
Now if i throw a query.
wh = Warehouse.objects.value_list(['name', 'sublocation__name',
'sublocation__city__name']).first()
it returns correct result but internally how many query is it throwing? is django fetching the data in one request?
Django makes only one query to the database for getting the data you described.
When you do:
wh = Warehouse.objects.values_list(
'name', 'sublocation__name', 'sublocation__city__name').first()
It translates in to this query:
SELECT "myapp_warehouse"."name", "myapp_sublocation"."name", "myapp_city"."name"
FROM "myapp_warehouse" INNER JOIN "myapp_sublocation"
ON ("myapp_warehouse"."sublocation_id" = "myapp_sublocation"."id")
INNER JOIN "myapp_city" ON ("myapp_sublocation"."city_id" = "myapp_city"."id")'
It gets the result in a single query. You can count number of queries in your shell like this:
from django.db import connection as c, reset_queries as rq
In [42]: rq()
In [43]: len(c.queries)
Out[43]: 0
In [44]: wh = Warehouse.objects.values_list('name', 'sublocation__name', 'sublocation__city__name').first()
In [45]: len(c.queries)
Out[45]: 1
My suggestion would be to write a test for this using assertNumQueries (docs here).
from django.test import TestCase
from yourproject.models import Warehouse
class TestQueries(TestCase):
def test_query_num(self):
"""
Assert values_list query executes 1 database query
"""
values = ['name', 'sublocation__name', 'sublocation__city__name']
with self.assertNumQueries(1):
Warehouse.objects.value_list(values).first()
FYI I'm not sure how many queries are indeed sent to the database, 1 is my current best guess. Adjust the number of queries expected to get this to pass in your project and pin the requirement.
There is extensive documentation on how and when querysets are evaluated in Django docs: QuerySet API Reference.
The pretty much standard way to have a good insight of how many and which queries are taken place during a page render is to use the Django Debug Toolbar. This could tell you precisely how many times this recordset is evaluated.
You can use django-debug-toolbar to see real queries to db

django most efficient way to count same field values in a query

Lets say if I have a model that has lots of fields, but I only care about a charfield. Lets say that charfield can be anything so I don't know the possible values, but I know that the values frequently overlap. So I could have 20 objects with "abc" and 10 objects with "xyz" or I could have 50 objects with "def" and 80 with "stu" and i have 40000 with no overlap which I really don't care about.
How do I count the objects efficiently? What I would like returned is something like:
{'abc': 20, 'xyz':10, 'other': 10,000}
or something like that, w/o making a ton of SQL calls.
EDIT:
I dont know if anyone will see this since I am editing it kind of late, but...
I have this model:
class Action(models.Model):
author = models.CharField(max_length=255)
purl = models.CharField(max_length=255, null=True)
and from the answers, I have done this:
groups = Action.objects.filter(author='James').values('purl').annotate(count=Count('purl'))
but...
this is what groups is:
{"purl": "waka"},{"purl": "waka"},{"purl": "waka"},{"purl": "waka"},{"purl": "mora"},{"purl": "mora"},{"purl": "mora"},{"purl": "mora"},{"purl": "mora"},{"purl": "lora"}
(I just filled purl with dummy values)
what I want is
{'waka': 4, 'mora': 5, 'lora': 1}
Hopefully someone will see this edit...
EDIT 2:
Apparently my database (BigTable) does not support the aggregate functions of Django and this is why I have been having all the problems.
You want something similar to "count ... group by". You can do this with the aggregation features of django's ORM:
from django.db.models import Count
fieldname = 'myCharField'
MyModel.objects.values(fieldname)
.order_by(fieldname)
.annotate(the_count=Count(fieldname))
Previous questions on this subject:
How to query as GROUP BY in django?
Django equivalent of COUNT with GROUP BY
This is called aggregation, and Django supports it directly.
You can get your exact output by filtering the values you want to count, getting the list of values, and counting them, all in one set of database calls:
from django.db.models import Count
MyModel.objects.filter(myfield__in=('abc', 'xyz')).\
values('myfield').annotate(Count('myfield'))
You can use Django's Count aggregation on a queryset to accomplish this. Something like this:
from django.db.models import Count
queryset = MyModel.objects.all().annotate(count = Count('my_charfield'))
for each in queryset:
print "%s: %s" % (each.my_charfield, each.count)
Unless your field value is always guaranteed to be in a specific case, it may be useful to transform it prior to performing a count, i.e. so 'apple' and 'Apple' would be treated as the same.
from django.db.models import Count
from django.db.models.functions import Lower
MyModel.objects.annotate(lower_title=Lower('title')).values('lower_title').annotate(num=Count('lower_title')).order_by('num')

Can Django do nested queries and exclusions

I need some help putting together this query in Django. I've simplified the example here to just cut right to the point.
MyModel(models.Model):
created = models.DateTimeField()
user = models.ForeignKey(User)
data = models.BooleanField()
The query I'd like to create in English would sound like:
Give me every record that was created yesterday for which data is False where in that same range data never appears as True for the given user
Here's an example input/output in case that wasn't clear.
Table Values
ID Created User Data
1 1/1/2010 admin False
2 1/1/2010 joe True
3 1/1/2010 admin False
4 1/1/2010 joe False
5 1/2/2010 joe False
Output Queryset
1 1/1/2010 admin False
3 1/1/2010 admin False
What I'm looking to do is to exclude record #4. The reason for this is because in the given range "yesterday", data appears as True once for the user in record #2, therefore that would exclude record #4.
In a sense, it almost seems like there are 2 queries taking place. One to determine the records in the given range, and one to exclude records which intersect with the "True" records.
How can I do this query with the Django ORM?
You don't need a nested query. You can generate a list of bad users' PKs and then exclude records containing those PKs in the next query.
bad = list(set(MyModel.obejcts.filter(data=True).values_list('user', flat=True)))
# list(set(list_object)) will remove duplicates
# not needed but might save the DB some work
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
# might not need the pk in user__pk__in - try it
You could condense that down into one line but I think that's as neat as you'll get. 2 queries isn't so bad.
Edit: You might wan to read the docs on this:
http://docs.djangoproject.com/en/dev/ref/models/querysets/#in
It makes it sound like it auto-nests the query (so only one query fires in the database) if it's like this:
bad = MyModel.objects.filter(data=True).values('pk')
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
But MySQL doesn't optimise this well so my code above (2 full queries) can actually end up running a lot faster.
Try both and race them!
looks like you could use:
from django.db.models import F
MyModel.objects.filter(datequery).filter(data=False).filter(data = F('data'))
F object available from version 1.0
Please, test it, I'm not sure.
Thanks to lazy evaluation, you can break your query up into a few different variables to make it easier to read. Here is some ./manage.py shell play time in the style that Oli already presented.
> from django.db import connection
> connection.queries = []
> target_day_qs = MyModel.objects.filter(created='2010-1-1')
> bad_users = target_day_qs.filter(data=True).values('user')
> result = target_day_qs.exclude(user__in=bad_users)
> [r.id for r in result]
[1, 3]
> len(connection.queries)
1
You could also say result.select_related() if you wanted to pull in the user objects in the same query.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])