How to do nested Group By with django orm? - django

I have the following data:
publisher title
-------------------------- -----------------------------------
New Age Books Life Without Fear
New Age Books Life Without Fear
New Age Books Sushi, Anyone?
Binnet & Hardley Life Without Fear
Binnet & Hardley The Gourmet Microwave
Binnet & Hardley Silicon Valley
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Here is what I want to do: I want to count the number of books published by each author in a single object.
I want to get the following result:
{publisher: New Age Books, titles: {Life Without Fear: 2, Sushi Anyone?: 1}},
{publisher: Binnet & Hardley, titles: {The Gourmet Microwave: 1, Silicon Valley: 1, Life Without Fear: 1}},
{publisher: Algodata Infosystems, titles: {But Is It User Friendly?: 3}}
My solution goes something along the lines of:
query_set.values('publisher', 'title').annotate(count=Count('title'))
But it is not producing the desired result.

You can post-process the results of the query with the groupby(…) function [Python-doc] of the itertools package [Python-doc]:
from django.db.models import Count
from itertools import groupby
from operator import itemgetter
qs = query_set.values('publisher', 'title').annotate(
count=Count('pk')
).order_by('publisher', 'title')
result = [
{
'publisher': p,
'titles': {r['title']: r['count'] for r in rs }
}
for p, rs in groupby(qs, itemgetter('publisher'))
]

Related

How to do nested Group By with Annotation in django orm?

I have the following data:
publisher title
-------------------------- -----------------------------------
New Age Books Life Without Fear
New Age Books Life Without Fear
New Age Books Sushi, Anyone?
Binnet & Hardley Life Without Fear
Binnet & Hardley The Gourmet Microwave
Binnet & Hardley Silicon Valley
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Here is what I want to do: I want to count how many books of the same titles are published by each author.
I want to get the following result:
{publisher: New Age Books, title: Life Without Fear, count: 2},
{publisher: New Age Books, title: Sushi Anyone?, count: 1},
{publisher: Binnet & Hardley, title: The Gourmet Microwave, count: 1},
{publisher: Binnet & Hardley, title: Silicon Valley, count: 1},
{publisher: Binnet & Hardley, title: Life Without Fear, count: 1},
{publisher: Algodata Infosystems, title: But Is It User Friendly?, count: 3}
My solution goes something along the lines of:
query_set.values('publisher', 'title').annotate(count=Count('title'))
But it is not producing the desired result.
There is a peculiarity in Django that will not perform a GROUP BY on the values without an .order_by() clause. You can thus add an .order_by() clause and process this with:
query_set.values('publisher', 'title').annotate(
count=Count('pk')
).order_by('publisher', 'title')
By ordering the items "fold" into a group and we thus count the number of primary keys for each group.

JSON_CONTAINS not working in django queryset but working fine at mysql workbench

I have a list of devices in my database, like this:
id: 1, category_json: [1, 3]
id: 2, category_json: [1, 4]
id: 3, category_json: [4, 35]
The field 'category_json' is a JSONField. I get an array of categories that the users wants from the frontend like this:
categories = [3]
I need to make a query in django to look up for all devices that has any of the categories above, I did this in django:
category_filtering = None
for category in categories:
if not category_filtering:
category_filtering = Q(category_json__contains=category)
else:
category_filtering = category_filtering | Q(category_json__contains=category)
all_devices: QuerySet = Device.objects.filter(Q(status=1) & Q(company__in=companies)) & category_filtering
That code above reproduces the following query:
SELECT `devices`.`id`, `devices`.`serial`, `devices`.`title`, `devices`.`product_id`, `devices`.`company_id`, `devices`.`server_access_key`, `devices`.`device_local_access_key`, `devices`.`mac_address`, `devices`.`status`, `devices`.`owner_device_id`, `devices`.`connected_equipament`, `devices`.`gps_x`, `devices`.`gps_y`, `devices`.`profile_id`, `devices`.`softwareVer`, `devices`.`hardwareVer`, `devices`.`isVirtual`, `devices`.`timezone`, `devices`.`installed_phase`, `devices`.`installation_local`, `devices`.`group`, `devices`.`relay_state`, `devices`.`last_sync`, `devices`.`phases_ordering`, `devices`.`created_at`, `devices`.`installation_date`, `devices`.`installation_state`, `devices`.`last_change`, `devices`.`last_online`, `devices`.`comments`, `devices`.`enable_bbd_home`, `devices`.`enable_bbd_pro`, `devices`.`enable_tseries_analytics`, `devices`.`enable_gd_production`, `devices`.`card_profile_id`, `devices`.`firmware_app_ver`, `devices`.`firmware_metrol_ver`, `devices`.`commissioning_date`, `devices`.`state`, `devices`.`area_square_meters`, `devices`.`category`, `devices`.`local`, `devices`.`code`, `devices`.`enable_advanced_monitoring`, `devices`.`send_data`, `devices`.`olson_timezone`, `devices`.`language`, `devices`.`extra_fields`, `devices`.`category_json` FROM `devices` WHERE (`devices`.`status` = 1 AND `devices`.`company_id` IN (SELECT U0.`id` FROM `companies` U0 WHERE (U0.`id` IN ((
WITH RECURSIVE ids(id) as (
select 297 as id
union all
select companies.id as id from ids join companies where ids.id = companies.company_owner
)
select id from ids)) AND U0.`is_active` = True)) AND JSON_CONTAINS(`devices`.`category_json`, (CAST("3" AS JSON))))
In django is returning an empty Queryset, like this:
all_devices <QuerySet []>
But when I just copy this query and run in MySQL Workbench, it works fine, it brings the device I want. What am I doing wrong??
Sorry for my bad english

Weird behavior in Django queryset union of values

I want to join the sum of related values from users with the users that do not have those values.
Here's a simplified version of my model structure:
class Answer(models.Model):
person = models.ForeignKey(Person)
points = models.PositiveIntegerField(default=100)
correct = models.BooleanField(default=False)
class Person(models.Model):
# irrelevant model fields
Sample dataset:
Person | Answer.Points
------ | ------
3 | 50
3 | 100
2 | 100
2 | 90
Person 4 has no answers and therefore, points
With the query below, I can achieve the sum of points for each person:
people_with_points = Person.objects.\
filter(answer__correct=True).\
annotate(points=Sum('answer__points')).\
values('pk', 'points')
<QuerySet [{'pk': 2, 'points': 190}, {'pk': 3, 'points': 150}]>
But, since some people might not have any related Answer entries, they will have 0 points and with the query below I use Coalesce to "fake" their points, like so:
people_without_points = Person.objects.\
exclude(pk__in=people_with_points.values_list('pk')).\
annotate(points=Coalesce(Sum('answer__points'), 0)).\
values('pk', 'points')
<QuerySet [{'pk': 4, 'points': 0}]>
Both of these work as intended but I want to have them in the same queryset so I use the union operator | to join them:
everyone = people_with_points | people_without_points
Now, for the problem:
After this, the people without points have their points value turned into None instead of 0.
<QuerySet [{'pk': 2, 'points': 190}, {'pk': 3, 'points': 150}, {'pk': 4, 'points': None}]>
Anyone has any idea of why this happens?
Thanks!
I should mention that I can fix that by annotating the queryset again and coalescing the null values to 0, like this:
everyone.\
annotate(real_points=Concat(Coalesce(F('points'), 0), Value(''))).\
values('pk', 'real_points')
<QuerySet [{'pk': 2, 'real_points': 190}, {'pk': 3, 'real_points': 150}, {'pk': 4, 'real_points': 0}]>
But I wish to understand why the union does not work as I expected in my original question.
EDIT:
I think I got it. A friend instructed me to use django-debug-toolbar to check my SQL queries to investigate further on this situation and I found out the following:
Since it's a union of two queries, the second query annotation is somehow not considered and the COALESCE to 0 is not used. By moving that to the first query it is propagated to the second query and I could achieve the expected result.
Basically, I changed the following:
# Moved the "Coalesce" to the initial query
people_with_points = Person.objects.\
filter(answer__correct=True).\
annotate(points=Coalesce(Sum('answer__points'), 0)).\
values('pk', 'points')
# Second query does not have it anymore
people_without_points = Person.objects.\
exclude(pk__in=people_with_points.values_list('pk')).\
values('pk', 'points')
# We will have the values with 0 here!
everyone = people_with_points | people_without_points

What is the most efficient method to parse this line of text?

The following is a row that I have extracted from the web:
AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]
I will like to create 5 separate variables by parsing the text and retrieving the relevant data. However, i seriously don't understand the REGEX documentation! Can anyone guide me on how i can do it correctly with this example?
Name = AIG
CurrentPrice = $30
Status = Active
World_Ranking = 3
History = 0.0510, 0.0500, 0.0300
Not sure what do you want to achieve here. There's no need to use regexps, you could just use str.split:
>>> str = "AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]"
>>> list = str.split()
>>> dict = { "Name": list[0], "CurrentPrice": list[1], "Status": list[19], "WorldRanking": list[20], "History": ' '.join((list[21], list[22], list[23])) }
#output
>>> dict
{'Status': 'Active', 'CurrentPrice': '$30', 'Name': 'AIG', 'WorldRanking': '3', 'History': '0.0510, 0.0500, 0.0300'}
Instead of using list[19] and so on, you may want to change it to list[-n] to not depend to the company's description length. Like that:
>>> history = ' '.join(list[-4:-1])
>>> history
'0.0510, 0.0500, 0.0300'
For floating history indexes it could be easier to use re:
>>> import re
>>> history = re.findall("\d\.\d{4}", str)
>>> ['0.0510', '0.0500', '0.0300']
For identifying status, you could get the indexes of history values and then substract by one:
>>> [ i for i, substr in enumerate(list) if re.match("\d\.\d{4}", substr) ]
[21, 22, 23]
>>> list[21:24]
['0.0510,', '0.0500,', '0.0300,']
>>> status = list[20]
>>> status
'3'

Using the "extra fields " from django many-to-many relationships with extra fields

Django documents give this example of associating extra data with a M2M relationship. Although that is straight forward, now that I am trying to make use of the extra data in my views it is feeling very clumsy (which typically means "I'm doing it wrong").
For example, using the models defined in the linked document above I can do the following:
# Some people
ringo = Person.objects.create(name="Ringo Starr")
paul = Person.objects.create(name="Paul McCartney")
me = Person.objects.create(name="Me the rock Star")
# Some bands
beatles = Group.objects.create(name="The Beatles")
my_band = Group.objects.create(name="My Imaginary band")
# The Beatles form
m1 = Membership.objects.create(person=ringo, group=beatles,
date_joined=date(1962, 8, 16),
invite_reason= "Needed a new drummer.")
m2 = Membership.objects.create(person=paul, group=beatles,
date_joined=date(1960, 8, 1),
invite_reason= "Wanted to form a band.")
# My Imaginary band forms
m3 = Membership.objects.create(person=me, group=my_band,
date_joined=date(1980, 10, 5),
invite_reason= "Want to be a star.")
m4 = Membership.objects.create(person=paul, group=my_band,
date_joined=date(1980, 10, 5),
invite_reason= "Wanted to form a better band.")
Now if I want to print a simple table that for each person gives the date that they joined each band, at the moment I am doing this:
bands = Group.objects.all().order_by('name')
for person in Person.objects.all():
print person.name,
for band in bands:
print band.name,
try:
m = person.membership_set.get(group=band.pk)
print m.date_joined,
except:
print 'NA',
print ""
Which feels very ugly, especially the "m = person.membership_set.get(group=band.pk)" bit. Am I going about this whole thing wrong?
Now say I wanted to order the people by the date that they joined a particular band (say the beatles) is there any order_by clause I can put on Person.objects.all() that would let me do that?
Any advice would be greatly appreciated.
You should query the Membership model instead:
members = Membership.objects.select_related('person', 'group').all().order_by('date_joined')
for m in members:
print m.band.name, m.person.name, m.date_joined
Using select_related here we avoid the 1 + n queries problem, as it tells the ORM to do the join and selects everything in one single query.