Django Sum on distinct values - django

I have three models:
ModelA, ModelB and ModelC
ModelB has a field called value, which I try to aggregate. However to find the correct ModelB instances I have to filter through fields of ModelC, which means I will have duplicates. Using the distinct clause on ModelB instances, means that I cannot use the Sum aggregate because that will raise a NotImplementedError from Django (Distinct + Aggregate is not NotImplemented).
Query:
ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id').aggregate(Sum('value'))
I could do something like this:
models_b = ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id')
sum = 0
for model_b in models_b:
sum += model_b.value
This is obviously quite heavy and slow. Is there anyway to circumvent the issue of the NotImplementedError?
I already tried SubQueries, pg_utils with DistinctSum (almost what I need, but I need the distinction on id not on value) and some stuff with values.
Edit: I forgot to mention that ModelC has a ForeignKey to ModelB and ModelB has a ForeignKey to ModelA. Therefore 1 ModelA has N ModelBs, and 1 ModelB has N ModelAs.
Edit2: I forgot to mention that I have the whole thing mapped out as a SQL Query and it works. However, I need the flexibility from the DjangoORM. Otherwise I have my headaches at a different spot. There I used group by clauses instead of distinct values, but I do not know how to achieve this in DjangoORM.

I think this should work,
models_b_id = ModelB.objects.filter(model_a=some_model_a, model_c__in=[some_vals]).distinct('id').values_list('id',flat=True)
stats = ModelB.objects.filter(id__in=model_b_id).aggregate(sum=Sum('values'))
queries are faster than loop.

Related

Spanning relationship to obtain field value with .annotate() and .values() clauses

I'm making some computations using values & annotate on queryset. Let's consider this model:
class Foo(models.Model):
fk_bar = models.ForeignKey(to=Bar, ....)
foo_val = models.IntegerField(...)
class Bar(models.Model):
attr = models.Charfield(...)
val = models.IntegerField(...)
So I can do:
Foo.object.all().values("fk_bar")
in order to group by the foreign relationship (some Foo might point to the same Bar).
Then I can do
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val"))
To the the sum of foo_val for all the object with the same fk_bar, which yields something like:
{"fk_bar":<int>, "qte": <int>}
However I want the resulting dictionnary to calso contain Bar.attr, e.g. something like:
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val")).annotate(bar_attr="fk_bar__attr")
To get something like:
{"fk_bar":<int>, "qte": <int>, "bar_attr":<str>}
However that fails (TypeError: Queryset.annotate() received a non-expression). Any ways to go around this?
One option is to specify the additional values to "keep" from before the aggregation in the values() clause, like so:
Foo.object.all().values("fk_bar", "fk_bar__attr").annotate(qte=Sum("foo_val"))
Which will yield roughly what's asked.
In general, if one puts more fields in the .values() clause, those fields will be available the in the output dictionary. Any "additional" fields (e.g. not from the raw queryset data) that needs to be computed beforehand (e.g. before aggregation) can be "created" with .annotation() (like qte=Sum(...) above), and then included in the subsequent .values("fk_bar", "qte", ...). A few examples:
Yields the sum for all the Foos with the same fk_bar:
Foo.object.all().values("fk_bar").annotate(qte=Sum("foo_val"))
This will group all the Foos by fk_bar, by foo_val. Then for each group, will sum up the values of the fk_bar__val field:
Foo.object.all().values("fk_bar", "foo_val").annotate(qte=Sum("fk_bar__val"))
The thing to note is that if many fields are specified in .values(), then the aggregation will be larger (e.g. 1 dictionnary for each unique combination of values).

Using django select_related with an additional filter

I'm trying to find an optimal way to execute a query, but got myself confused with the prefetch_related and select_related use cases.
I have a 3 table foreign key relationship: A -> has 1-many B h-> as 1-many C.
class A(models.model):
...
class B(models.model):
a = models.ForeignKey(A)
class C(models.model):
b = models.ForeignKey(B)
data = models.TextField(max_length=50)
I'm trying to get a list of all C.data for all instances of A that match a criteria (an instance of A and all its children), so I have something like this:
qs1 = A.objects.all().filter(Q(id=12345)|Q(parent_id=12345))
qs2 = C.objects.select_related('B__A').filter(B__A__in=qs1)
But I'm wary of the (Prefetch docs stating that:
any subsequent chained methods which imply a different database query
will ignore previously cached results, and retrieve data using a fresh
database query
I don't know if that applies here (because I'm using select_related), but reading it makes it seem as if anything gained from doing select_related is lost as soon as I do the filter.
Is my two-part query as optimal as it can be? I don't think I need prefetch as far as I'm aware, although I noticed I can swap out select_related with prefetch_related and get the same result.
I think your question is driven by a misconception. select_related (and prefetch_related) are an optimisation, specifically for returning values in related models along with the original query. They are never required.
What's more, neither has any impact at all on filter. Django will automatically do the relevant joins and subqueries in order to make your query, whether or not you use select_related.

django values not working

When I try to call values with more than 3 fields it seems to 'break' (ie. it doesn't group duplicate entries together)
My model is a through model with three fields, 2 ForeignKey and one DateTimeField
ProjectView(models.Model):
user = models.ForeignKey(User)
project = models.ForeignKey(Project)
datetime_created = models.DateTimeField()
I want to do:
ProjectView.objects.filter(datetime_created__gt=yesterday).values('project__id', 'project__title', 'project__thumbnail', 'project__creator_username')
If i get rid of any one of the values fields it groups them by same projects without duplicates, if there are 4 values it seems to do no grouping. Am i doing something wrong?
If you take a look at the docs for values, you'll see no guarantee of grouping or distinct. If you want that functionality, you'll have to call .order_by() and/or .distinct() when making you call to the ORM.
That it works at all is probably just a side effect of the SQL generated. If you want to see the SQL, take a look at Django-debug-toolbar

difference between values() and only()

what's the difference between using:
Blabla.objects.values('field1', 'field2', 'field3')
and
Blabla.objects.only('field1', 'field2', 'field3')
Assuming Blabla has the fields in your question, as well as field4,
Blabla.objects.only('field1', 'field2', 'field3')[0].field4
will return the value of that object's field4 (with a new database query to retrieve that info), whereas
Blabla.objects.values('field1', 'field2', 'field3')[0].field4
will give
AttributeError: 'dict' object has no attribute 'field4'
This is because .values() returns a QuerySet that returns dictionaries, which is essentially a list of dicts, rather than model instances (Blabla).
The answers given so far are correct, but they don't mention some of the differences in terms of queries. I will just report them.
.get()
# User.objects.get(email='user#email.com').username
SELECT "users_user"."id", "users_user"."password", "users_user"."last_login",
"users_user"."is_superuser", "users_user"."username", "users_user"."first_name",
"users_user"."last_name", "users_user"."email", "users_user"."is_staff", "users_user"."is_active",
"users_user"."date_joined", "users_user"."name"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
.only().get()
# User.objects.only('username').get(email='user#email.com')
SELECT "users_user"."id", "users_user"."username"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
.values().get()
# User.objects.values('username').get(email='user#email.com')
SELECT "users_user"."username"
FROM "users_user"
WHERE "users_user"."email" = 'user#email.com'; args=('user#email.com',)
As you can see, only() will also select the id of the record. This is probably because of the fact that it will output a model that you can later use, as the other answers mentioned.
.values() gives you "less than a model"; the items it returns are closer to dictionaries than full models, which means you don't get the model attributes but you also don't have to initialize full models.
.only() restricts the field list in the SQL to specific fields you care about, but still initializes a full model; it defers loading of the other fields until you access them (if at all).
values() returns QuerySet - which when iterated returns dictionaries representing the model. It does not return model objects.
only() is a way of restricting the columns returned, and ensuring that only those columns are returned immediately - which is why it is sometimes referred to as the opposite of defer() It is the equivalent of saying SELECT foo, bar, zoo FROM rather than the normal SELECT [all columns] FROM. It will return a QuerySet that can further be chained.

Sort by number of matches on queries based on m2m field

I hope the title is not misleading.
Anyway, I have two models, both have m2m relationships with a third model.
class Model1: keywords = m2m(Keyword)
class Model2: keywords = m2m(Keyword)
Given the keywords for a Model2 instance like this:
keywords2 = model2_instance.keywords.all()
I need to retrieve the Model1 instances which have at least a keyword that is in keywords2, something like:
Model1.objects.filter(keywords__in=keywords2)
and sort them by the number of keywords that match (dont think its possible via 'in' field lookup). Question is, how do i do this?
I'm thinking of just manually interating through each of Model1 instances, appending them to a dictionary of results for every match, but I need this to scale, for say tens of thousands of records. Here is how I imagined it would be like:
result = {}
keywords2_ids = model2.keywords.all().values_list('id',flat=True)
for model1 in Model1.objects.all():
keywords_matched = model1.keywords.filter(id__in=keywords2_ids).count()
objs = result.get(str(keywords_matched), [])
result[str(keywords_matched)] = objs.append(obj)
There must be an faster way to do this. Any ideas?
You can just switch to raw SQL. What you have to do is to write a custom manager for Model1 to return the sorted set of ids of Model1 objects based on the keyword match counts. The SQL is simple as joining the two many to many tables(Django automatically creates a table to represent a many to many relationship) on keyword ids and then grouping on Model1 ids for COUNT sql function. Then using an ORDER BY clause on those counts will produce the sorted Model1 id list you need. In MySQL,
SELECT appname_model1_keywords.model1_id, count(*) as match_count FROM appname_model1_keywords
JOIN appname_model2_keywords
ON (appname_model1_keywords.keyword_id = appname_model2_keywords.keyword_id)
WHERE appname_model2_keywords.model2_id = model2_object_id
GROUP BY appname_model1_keywords.model1_id
ORDER BY match_count
Here model2_object_id is the model2_instance id. This will definitely be faster and more scalable.