Spanning relationship to obtain field value with .annotate() and .values() clauses - django

I'm making some computations using values & annotate on queryset. Let's consider this model:
class Foo(models.Model):
fk_bar = models.ForeignKey(to=Bar, ....)
foo_val = models.IntegerField(...)
class Bar(models.Model):
attr = models.Charfield(...)
val = models.IntegerField(...)
So I can do:
Foo.object.all().values("fk_bar")
in order to group by the foreign relationship (some Foo might point to the same Bar).
Then I can do
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val"))
To the the sum of foo_val for all the object with the same fk_bar, which yields something like:
{"fk_bar":<int>, "qte": <int>}
However I want the resulting dictionnary to calso contain Bar.attr, e.g. something like:
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val")).annotate(bar_attr="fk_bar__attr")
To get something like:
{"fk_bar":<int>, "qte": <int>, "bar_attr":<str>}
However that fails (TypeError: Queryset.annotate() received a non-expression). Any ways to go around this?

One option is to specify the additional values to "keep" from before the aggregation in the values() clause, like so:
Foo.object.all().values("fk_bar", "fk_bar__attr").annotate(qte=Sum("foo_val"))
Which will yield roughly what's asked.
In general, if one puts more fields in the .values() clause, those fields will be available the in the output dictionary. Any "additional" fields (e.g. not from the raw queryset data) that needs to be computed beforehand (e.g. before aggregation) can be "created" with .annotation() (like qte=Sum(...) above), and then included in the subsequent .values("fk_bar", "qte", ...). A few examples:
Yields the sum for all the Foos with the same fk_bar:
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val"))
This will group all the Foos by fk_bar, by foo_val. Then for each group, will sum up the values of the fk_bar__val field:
Foo.object.all().values("fk_bar", "foo_val").annotate(qte=Sum("fk_bar__val"))
The thing to note is that if many fields are specified in .values(), then the aggregation will be larger (e.g. 1 dictionnary for each unique combination of values).

Related

AND query against foreign key table in django ORM

Given:
class Video(models.Model):
tags = models.ManyToManyField(Tag)
class Tag(models.Model):
name = models.CharField(max_length=20)
I know I can use Video.objects.filter(tags__name__in=['foo','bar']) to find all Videos that have either foo OR bar tags, but in order to find those that have foo AND bar, I'd have to join the foreign key twice (if I was handwriting the SQL). Is there a way to accomplish this in Django?
I've already tried .filter(Q(tag__name='foo') & Q(tag__name='bar')) but that just creates the impossible to satisfy query where a single Tag has both foo and bar as its name.
This is not as straighfroward as it might look. Furthermore JOINing two times with the same table is typically not a good idea at all: imagine that your list contains ten elements. Are you going to JOIN ten times? This easily would become infeasible.
What we can do however, is count the overlap. So if we are given a list of elements, we first make sure those elements are unique:
tag_list = ['foo', 'bar']
tag_set = set(tag_list)
Next we count the number of tags of the Video that are actually in the set, and we then check if that number is the same as the number of elements in our set, like:
from django.db.models import Q
Video.objects.filter(
Q(tag__name__in=tag_set) | Q(tag__isnull=True)
).annotate(
overlap=Count('tag')
).filter(
overlap=len(tag_set)
)
Note that the Q(tag__isnull-True) is used to enable Videos without tags. This might look unnecessary, but if the tag_list is empty, we thus want to obtain all videos (since those have zero tags in common).
We also make the assumption that the names of the Tags are unique, otherwise some tags might be counted twice.
Behind the curtains, we will perform a query like:
SELECT `video`.*, COUNT(`video_tag`.`tag_id`) AS overlap
FROM `video`
LEFT JOIN `video_tag` ON `video_tag`.`video_id` = `video`.`id`
LEFT JOIN `tag` ON `tag`.`id` = `video_tag`.`tag_id`
WHERE `tag`.`name` IN ('foo', 'bar')
GROUP BY `video`.`id`
HAVING overlap = 2

Return object when aggregating grouped fields in Django

Assuming the following example model:
# models.py
class event(models.Model):
location = models.CharField(max_length=10)
type = models.CharField(max_length=10)
date = models.DateTimeField()
attendance = models.IntegerField()
I want to get the attendance number for the latest date of each event location and type combination, using Django ORM. According to the Django Aggregation documentation, we can achieve something close to this, using values preceding the annotation.
... the original results are grouped according to the unique combinations of the fields specified in the values() clause. An annotation is then provided for each unique group; the annotation is computed over all members of the group.
So using the example model, we can write:
event.objects.values('location', 'type').annotate(latest_date=Max('date'))
which does indeed group events by location and type, but does not return the attendance field, which is the desired behavior.
Another approach I tried was to use distinct i.e.:
event.objects.distinct('location', 'type').annotate(latest_date=Max('date'))
but I get an error
NotImplementedError: annotate() + distinct(fields) is not implemented.
I found some answers which rely on database specific features of Django, but I would like to find a solution which is agnostic to the underlying relational database.
Alright, I think this one might actually work for you. It is based upon an assumption, which I think is correct.
When you create your model object, they should all be unique. It seems highly unlikely that that you would have two events on the same date, in the same location of the same type. So with that assumption, let's begin: (as a formatting note, class Names tend to start with capital letters to differentiate between classes and variables or instances.)
# First you get your desired events with your criteria.
results = Event.objects.values('location', 'type').annotate(latest_date=Max('date'))
# Make an empty 'list' to store the values you want.
results_list = []
# Then iterate through your 'results' looking up objects
# you want and populating the list.
for r in results:
result = Event.objects.get(location=r['location'], type=r['type'], date=r['latest_date'])
results_list.append(result)
# Now you have a list of objects that you can do whatever you want with.
You might have to look up the exact output of the Max(Date), but this should get you on the right path.

Django Sum on distinct values

I have three models:
ModelA, ModelB and ModelC
ModelB has a field called value, which I try to aggregate. However to find the correct ModelB instances I have to filter through fields of ModelC, which means I will have duplicates. Using the distinct clause on ModelB instances, means that I cannot use the Sum aggregate because that will raise a NotImplementedError from Django (Distinct + Aggregate is not NotImplemented).
Query:
ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id').aggregate(Sum('value'))
I could do something like this:
models_b = ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id')
sum = 0
for model_b in models_b:
sum += model_b.value
This is obviously quite heavy and slow. Is there anyway to circumvent the issue of the NotImplementedError?
I already tried SubQueries, pg_utils with DistinctSum (almost what I need, but I need the distinction on id not on value) and some stuff with values.
Edit: I forgot to mention that ModelC has a ForeignKey to ModelB and ModelB has a ForeignKey to ModelA. Therefore 1 ModelA has N ModelBs, and 1 ModelB has N ModelAs.
Edit2: I forgot to mention that I have the whole thing mapped out as a SQL Query and it works. However, I need the flexibility from the DjangoORM. Otherwise I have my headaches at a different spot. There I used group by clauses instead of distinct values, but I do not know how to achieve this in DjangoORM.
I think this should work,
models_b_id = ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id').values_list('id',flat=True)
stats = ModelB.objects.filter(id__in=model_b_id).aggregate(sum=Sum('values'))
queries are faster than loop.

difference between values() and only()

what's the difference between using:
Blabla.objects.values('field1', 'field2', 'field3')
and
Blabla.objects.only('field1', 'field2', 'field3')
Assuming Blabla has the fields in your question, as well as field4,
Blabla.objects.only('field1', 'field2', 'field3')[0].field4
will return the value of that object's field4 (with a new database query to retrieve that info), whereas
Blabla.objects.values('field1', 'field2', 'field3')[0].field4
will give
AttributeError: 'dict' object has no attribute 'field4'
This is because .values() returns a QuerySet that returns dictionaries, which is essentially a list of dicts, rather than model instances (Blabla).
The answers given so far are correct, but they don't mention some of the differences in terms of queries. I will just report them.
.get()
# User.objects.get(email='user#email.com').username
SELECT "users_user"."id", "users_user"."password", "users_user"."last_login",
"users_user"."is_superuser", "users_user"."username", "users_user"."first_name",
"users_user"."last_name", "users_user"."email", "users_user"."is_staff", "users_user"."is_active",
"users_user"."date_joined", "users_user"."name"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
.only().get()
# User.objects.only('username').get(email='user#email.com')
SELECT "users_user"."id", "users_user"."username"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
.values().get()
# User.objects.values('username').get(email='user#email.com')
SELECT "users_user"."username"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
As you can see, only() will also select the id of the record. This is probably because of the fact that it will output a model that you can later use, as the other answers mentioned.
.values() gives you "less than a model"; the items it returns are closer to dictionaries than full models, which means you don't get the model attributes but you also don't have to initialize full models.
.only() restricts the field list in the SQL to specific fields you care about, but still initializes a full model; it defers loading of the other fields until you access them (if at all).
values() returns QuerySet - which when iterated returns dictionaries representing the model. It does not return model objects.
only() is a way of restricting the columns returned, and ensuring that only those columns are returned immediately - which is why it is sometimes referred to as the opposite of defer() It is the equivalent of saying SELECT foo, bar, zoo FROM rather than the normal SELECT [all columns] FROM. It will return a QuerySet that can further be chained.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])