Django-ORM: distinct is needed. Why? - django

I am playing around with django ORM
import django
django.setup()
from django.contrib.auth.models import User, Group
from django.db.models import Count
# All users
print(User.objects.all().count())
# --> 742
# Should be: All users which are in a group.
# But the result is different. I don't understand this.
print(User.objects.filter(groups__in=Group.objects.all()).count())
# --> 1731
# All users which are in a group.
# distinct needed
print(User.objects.filter(groups__in=Group.objects.all()).distinct().count())
# --> 543
# All users which are in a group. Without distinct, annotate seems to do this.
print(User.objects.filter(groups__in=Group.objects.all()).annotate(Count('pk')).count())
# --> 543
# All users which are in no group
print(User.objects.filter(groups__isnull=True).count())
# --> 199
# 199 + 543 = 742 (nice)
I don't understand the second query which returns 1731.
I know that I can use distinct().
Nevertheless 1731 looks like a bug to me.
What is the intention why below query is not distinct/unique?
User.objects.filter(groups__in=Group.objects.all())

Raw MySQL query looks like this:
SELECT user.id, group.id FROM user LEFT JOIN group ON user.group_id = group.id
The result will contain all possible combinations of users and groups and I guess some users belong to more than one group.

You are trying to fetch all users from all groups, but a user can present in multiple groups that's why distinct is required. if you want users ina specific group instead of doing an all try a filter query.

I assume that User.groups is a ForeignKey or some other relationship that associates each User with zero to many Group instances.
So the query which confuses you:
User.objects.filter(groups__in=Group.objects.all())
That query can be described as:
Access the Group model manager (Group.objects).
Make a QuerySet:
Return all Group instances (Group.objects.all()).
Access the User model manager (User.objects).
Make a Queryset:
Join to the Group model, on the User.groups foreign key.
Return every (User + Group) row which has an associated Group.
That is not “all users which are in a group”; instead, it is “All user–group pairs where the group exists”.
By querying on each of the multiple-value User.groups field, you are implying that the query must contain a join from User to Group rows.
Instead, you want:
Access the User model manager (User.objects).
Make a QuerySet:
Return all rows which have groups not empty.
User.objects.filter(groups__isnull=False)
Note that this – “All users which have a non-empty set of associated groups” – is the inverse of another example query you have (“All users which are in no group”).

Since groups is a ManyToManyField the query translated into INNER JOIN statement.
If you print the following you will see the query generated by the QuerySet:
>>> print(User.objects.filter(groups__in=Group.objects.all()).query)
SELECT `auth_user`.`id`, .... , `auth_user`.`date_joined` FROM `auth_user` INNER JOIN `auth_user_groups` ON (`auth_user`.`id` = `auth_user_groups`.`user_id`) WHERE `auth_user_groups`.`group_id` IN (SELECT `auth_group`.`id` FROM `auth_group`)
As you would see the query joins auth_user and auth_user_groups tables.
Where auth_user_groups is the ManyToManyField table not the table for Group model. Thus a user will come more than once.
You would want to use annotate get users having grous, in my case the numbers are following:
$ ./manage.py shell
>>>
>>> from django.contrib.auth.models import User, Group
>>> from django.db.models import Count
>>>
# All users
>>> print(User.objects.all().count())
556
>>>
# All users which are not in a group.
>>> print(User.objects.annotate(group_count=Count('groups')).filter(group_count=0).count())
44
>>>
# All users which are in a group.
>>> print(User.objects.annotate(group_count=Count('groups')).filter(group_count__gt=0).count())
512
>>>
Annotate is similar to distinct in behaviour. It creates a group by query. You can see and inspect the query as following.
>>> print(User.objects.annotate(group_count=Count('groups')).filter(group_count__gt=0).query)
SELECT `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined`, COUNT(`auth_user_groups`.`group_id`) AS `group_count` FROM `auth_user` LEFT OUTER JOIN `auth_user_groups` ON (`auth_user`.`id` = `auth_user_groups`.`user_id`) GROUP BY `auth_user`.`id` HAVING COUNT(`auth_user_groups`.`group_id`) > 0 ORDER BY NULL

When you run a 'DISTINCT' query against a database you end up with a listing of each distinct row in the data results. The reason that you have more 'DISTINCT' rows in your Django result is there is a combinatoric cross multiplication going on, creating extra results.
Other answers have mentioned all of this, but since you're asking the why:
The ORM, in this join, would probably allow you to pull fields attached to the group from the query. So if you wanted, say, all these users and all the groups and the group contact for some kind of massive weird mail merge, you could get them.
The post-processing brought on by DISTINCT is narrowing your results down according to the fields you have pulled rather than the rows in the query. If you were to use the PyCharm debugger or something, you might find that the groups aren't as easy to access using various ORM syntax when you have the distinct as when you don't.

Related

Django-ORM: Count number of groups of each user

I know that I can get the number of groups each user has with this query:
User.objects.filter(groups__in=Group.objects.all()).annotate(Count('pk'))
But something is missing:
The users which are in no group at all.
How can I use the django orm to get all users annotated by their group count, incluse users with no group?
You can use Count method with groups attribute:
from django.db.models import Count
User.objects.annotate(group_count=Count('groups'))

Django get all values Group By particular one field

I want to execute a simple query like:
select *,count('id') from menu_permission group by menu_id
In Django format I have tried:
MenuPermission.objects.all().values('menu_id').annotate(Count('id))
It selects only menu_id. The executed query is:
SELECT `menu_permission`.`menu_id`, COUNT(`menu_permission`.`id`) AS `id__count` FROM `menu_permission` GROUP BY `menu_permission`.`menu_id`
But I need other fields also. If I try:
MenuPermission.objects.all().values('id','menu_id').annotate(Count('id))
It adds 'id' in group by condition.
GROUP BY `menu_permission`.`id`
As a result I am not getting the expected result. How I can get all all fields in the output but group by a single one?
You can try subqueries to do what you need.
In my case I have two tables: Item and Transaction where item_id links to Item
First, I prepare Transaction subquery with group by item_id where I sum all amount fields and mark item_id as pk for outer query.
per_item_total=Transaction.objects.values('item_id').annotate(total=Sum('amount')).filter(item_id=OuterRef('pk'))
Then I select all rows from item plus subquery result as total filed.
items_with_total=Item.objects.annotate(total=Subquery(per_item_total.values('total')))
This produces the following SQL:
SELECT `item`.`id`, {all other item fields},
(SELECT SUM(U0.`amount`) AS `total` FROM `transaction` U0
WHERE U0.`item_id` = `item`.`id` GROUP BY U0.`item_id` ORDER BY NULL) AS `total` FROM `item`
You are trying to achieve this SQL:
select *, count('id') from menu_permission group by menu_id
But normally SQL requires that when a group by clause is used you only include those column names in the select that you are grouping by. This is not a django matter, but that's how SQL group by works.
The rows are grouped by those columns so those columns can be included in select and other columns can be aggregated if you want them to into a value. You can't include other columns directly as they may have more than one value (since the rows are grouped).
For example if you have a column called "permission_code", you could ask for an array of the values in the "permission_code" column when the rows are grouped by menu_id.
Depending on the SQL flavor you are using, this could be in PostgreSQL something like this:
select menu_id, array_agg(permission_code), count(id) from menu_permissions group by menu_id
Similary django queryset can be constructed for this.
Hopefully this helps, but if needed please share more about what you need to do and what your data models are.
The only way currently that it works as expected is to hve your query based on the model you want the GROUP BY to be based on.
In your case it looks like you have a Menu model (menu_id field foreign key) so doing this would give you what you want and will allow getting other aggregate information from your MenuPermission model but will only group by the Menu.id field:
Menu.objects.annotate(perm_count=Count('menupermission__id')).values('perm_count')
Of course there is no need for the "annotate" intermediate step if all you want is that single count.
query = MenuPermission.objects.values('menu_id').annotate(menu_id_count=Count('menu_id'))
You can check your SQL query by print(query.query)
This solution doesn't work, all fields end up in the group by clause, leaving it here because it may still be useful to someone.
model_fields = queryset.model._meta.get_fields()
queryset = queryset.values('menu_id') \
.annotate(
count=Count('id'),
**{field.name: F(field.name) for field in model_fields}
)
What i'm doing is getting the list of fields of our model, and set up a dictionary with the field name as key and an F instance with the field name as a parameter.
When unpacked (the **) it gets interpreted as named arguments passed into the annotate function.
For example, if we had a "name" field on our model, this annotate call would end up being equal to this:
queryset = queryset.values('menu_id') \
.annotate(
count=Count('id'),
name=F("name")
)
you can use the following code:
MenuPermission.objects.values('menu_id').annotate(Count('id)).values('field1', 'field2', 'field3'...)

Django join query for permissions and group?

I want to somehow join a group with all my permissions.
I want to query ALL permissions, and for each permission query a boolean indicator if the group has it.
So assume this
group = Group.objects.get(pk=1) # specific group
Permission.objects.all().annotate( group has it ??)
group.permissions.all()
won't help since I want to query all permissions.
UPDATE:
Clear explanation:
Assume my Permission table is (pk values): 1, 2 ,3 - total three rows.
Group table: one group with pk=1.
Group-permission (many-to-many) table: group with pk 1, has permission 1,2 (so two rows)
I want to display all the permissions, with an indicator near them whether the group has it.
So in our case I should get :
1 True
2 True
3 False
Cause the group don't have permission with pk=3.
I think the below query would work for you
from django.db.models import Case, When, BooleanField
group_name = 'your group name'
Permission.objects.annotate(
has_perm=Max(Case(
When(group__name=group_name, then=1),
default=0,
output_field=BooleanField()
))
).values_list('name', 'has_perm').order_by('has_perm')
Conditional expressions were introduced in Django 1.8, and aggregation (annotate) is also well documented.

Hourly grouping of rows using Django

I have been trying to group the results of table into Hourly format using DateTimeField.
SQL:
SELECT strftime('%H', created_on), count(*)
FROM users_test
GROUP BY strftime('%H', created_on);
This query works fine, but the corresponding Django query does not.
Django queries I've tried:
Test.objects.extra({'hour': 'strftime("%%H", created_on)'}).values('hour').annotate(count=Count('id'))
# SELECT (strftime("%H", created_on)) AS "hour", COUNT("users_test"."id") AS "count" FROM "users_test" GROUP BY (strftime("%H", created_on)), "users_test"."created_on" ORDER BY "users_test"."created_on" DESC
It adds additional group by "users_test"."created_on", which I guess is giving incorrect results.
It would be great if anyone can explain me this and provide a solution as well.
Environment:
Python 3
Django 1.8.1
Thanks in Advance
References (Possible Duplicates) (But None helping out):
Grouping Django model entries by day using its datetime field
Django - Group By with Date part alone
Django aggregate on .extra values
To fix it, append order_by() to query chain. This will override model Meta default ordering. Like this:
Test
.objects
.extra({'hour': 'strftime("%%H", created_on)'})
.order_by() #<------ here
.values('hour')
.annotate(count=Count('id'))
In my environment ( Postgres also ):
>>> print ( Material
.objects
.extra({'hour': 'strftime("%%H", data_creacio)'})
.order_by()
.values('hour')
.annotate(count=Count('id'))
.query )
SELECT (strftime("%H", data_creacio)) AS "hour",
COUNT("material_material"."id") AS "count"
FROM "material_material"
GROUP BY (strftime("%H", data_creacio))
Learn more in order_by django docs:
If you don’t want any ordering to be applied to a query, not even the default ordering, call order_by() with no parameters.
Side note:
using extra() may introduce SQL injection vulnerability to your code. Use this with precaution and escape any parameters that user can introduce. Compare with docs:
Warning
You should be very careful whenever you use extra(). Every time you
use it, you should escape any parameters that the user can control by
using params in order to protect against SQL injection attacks .
Please read more about SQL injection protection.

Django Model QuerySet Group By Specific Fields

Considering this model & data:
Ad_id + Date = primary key
Ad_id date clicks
------------------------------
3 8/10/12 124
3 7/10/12 433
3 6/10/12 99
4 8/10/12 23
4 7/10/12 80
I'm trying to group by ad_id to return the sum of the over all clicks.
in sql terms:
select Ad_id, date, sum(clicks) from ads group by Ad_id
The problem is the Django automatically do the group by for each field in the model, so the group by is not really working (because each row is unique).
Solutions I've already checked:
I know it is possible to do something like this:
Ad.objects.values('ad_id').annotate(clicks_sum=Sum('clicks'))
But it is not good as it doesn't return the Ad Model, but a dictionary.
I can't use also raw SQL because it is not chain-able
Also I tried to set
MyQuerySet.group_by = ['ad_id']
Not working too..
So I really need to group by only by the fields I need, and that the result will be an Ad Model.
You can perform raw SQL queries using Manager.raw(), in your case that would be:
Ad.objects.raw('select Ad_id, date, sum(clicks) from ads group by Ad_id')
This method method takes a raw SQL query, executes it, and returns a RawQuerySet instance. This RawQuerySet instance can be iterated over just like an normal QuerySet to provide object instances.