Django: annotate a model with multiple counts is super slow - django

I'm trying to annotate a model that has multiple relationships, with multiple counts of those relationships. But the query is super slow.
Campaign.objects.annotate(
num_characters=Count("character", distinct=True),
num_factions=Count("faction", distinct=True),
num_locations=Count("location", distinct=True),
num_quests=Count("quest", distinct=True),
num_loot=Count("loot", distinct=True),
num_entries=Count("entry", distinct=True),
)
When I mean super slow, I mean it: it takes multiple minutes on my local MacBook Pro with the M1 Max 😰 And there aren't even that many rows in these tables.
If I simply fetch all campaigns, loop over them, and then get the counts of all these related objects in separate queries, it's a LOT faster:
campaigns = Campaign.objects.all()
for campaign in campaigns:
campaign.num_characters = campaign.character_set.count()
campaign.num_factions = campaign.faction_set.count()
campaign.num_locations = campaign.location_set.count()
campaign.num_quests = campaign.quest_set.count()
campaign.num_loot = campaign.loot_set.count()
campaign.num_entries = campaign.entry_set.count()
But this is doing a lot of queries of course, which isn't ideal either. Can't this query be optimized somehow?

While a bit ugly you should be able to speed up the query by using subqueries instead
from django.db.models import OuterRef, Subquery, Count
Campaign.objects.annotate(
num_characters=Subquery(Character.objects.filter(campaign=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
num_factions=Subquery(Faction.objects.filter(campaign_id=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
num_locations=Subquery(Location.objects.filter(campaign_id=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
num_quests=Subquery(Quest.objects.filter(campaign_id=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
num_loot=Subquery(Loot.objects.filter(campaign_id=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
num_entries=Subquery(Entry.objects.filter(campaign_id=OuterRef('pk')).order_by().values('campaign').annotate(count=Count('campaign')).values('count')),
)

Related

Django viewset annotate in subquery based on filterset field

It seems like this should be a common use case but i cant find an existing answer online.
I am trying to annotate a count based on a query that is filtered using a filterset field in DRF.
class SurveyViewset(viewsets.ModelViewSet):
entries = models.SurveyEntry.objects.filter(
survey=OuterRef("id")
).order_by().annotate(
count=Func(F('id'), function='Count')
).values('count')
queryset = models.Survey.objects.annotate(
total_entries=Subquery(entries)
).all().order_by("id")
serializer_class = serializers.SurveySerializer
filter_backends = (
SurveyQueryParamsValidator,
CaseInsensitiveOrderingFilter,
django_filters.DjangoFilterBackend,
SearchFilter,
)
filterset_fields = {
"surveyaddressgroup": ("exact",),
}
I have surveys and I want to count the number of SurveyEntry based on a a particular address group.
I.e. I ask a survey in a several shopping centres, and I want to see the results when i only choose 1 particular centre to show. At the moment, I get total count regardless of filter the main query.
How can i make the subquery take the filterset choice into account?

Django: querying models not related with FK

I'm developing a Django project. I need to make many queries with the following pattern:
I have two models, not related by a FK, but that can be related by some fields (not their PKs).
I need to query the first model, and annotate it with results from the second model, joined by that field that is not de PK.
I can do it with a Subquery and an OuterRef function.
M2_queryset = M2.objects.filter(f1 = OuterRef('f2'))
M1.objects.annotate(b_f3 = Subquery(M2_queryset.values('f3')))
But if I need to annotate two columns, I need to do this:
M2_queryset = M2.objects.filter(f1 = OuterRef('f2'))
M1.objects.annotate(b_f3 = Subquery(M2_queryset.values('f3'))).annotate(b_f4 = Subquery(M2_queryset.values('f4')))
It's very inefficient because of the two identical subqueries.
It would be very interesting doing something like this:
M2_queryset = M2.objects.filter(f1 = OuterRef('f2'))
M1.objects.annotate(b_f3, b_f4 = Subquery(M2_queryset.values('f3','f4')))
or more interesting something like this and avoiding subqueries:
M1.objects.join(M2 on M2.f1 = M1.f2)...
For example in this model:
db
I need to do this regular query:
select m1.id,m1.f5, sum(m2.f2), sum(m2.f3)
from M1, M2
where M1.f1 = M2.f2
group by 1,2
without a fk between f1 and f2.

Django distinct related querying

I have two models:
Model A is an AbstractUserModel and Model B
class ModelB:
user = ForeignKey(User, related_name='modelsb')
timestamp = DateTimeField(auto_now_add=True)
What I want to find is how many users have at least one ModelB object created at least in 3 of the 7 past days.
So far, I have found a way to do it but I know for sure there is a better one and that is why I am posting this question.
I basically split the query into 2 parts.
Part1:
I added a foo method inside the User Model that checks if a user meets the above conditions
def foo(self):
past_limit = starting_date - timedelta(days=7)
return self.modelsb.filter(timestamp__gte=past_limit).order_by('timestamp__day').distinct('timestamp__day').count() > 2
Part 2:
In the Custom User Manager, I find the users that have more than 2 modelsb objects in the last 7 days and iterate through them applying the foo method for each one of them.
By doing this I narrow down the iterations of the required for loop. (basically its a filter function but you get the point)
def boo(self):
past_limit = timezone.now() - timedelta(days=7)
candidates = super().get_queryset().annotate(rc=Count('modelsb', filter=Q(modelsb__timestamp__gte=past_limit))).filter(rc__gt=2)
return list(filter(lambda x: x.foo(), candidates))
However, I want to know if there is a more efficient way to do this, that is without the for loop.
You can use conditional annotation.
I haven't been able to test this query, but something like this should work:
from django.db.models import Q, Count
past_limit = starting_date - timedelta(days=7)
users = User.objects.annotate(
modelsb_in_last_seven_days=Count('modelsb__timestap__day',
filter=Q(modelsb__timestamp__gte=past_limit),
distinct=True))
.filter(modelsb_in_last_seven_days__gte = 3)
EDIT:
This solution did not work, because the distinct option does specify what field makes an entry distinct.
I did some experimenting on my own Django instance, and found a way to make this work using SubQuery. The way this works is that we generate a subquery where we make the distinction ourself.
counted_modelb = ModelB.objects
.filter(user=OuterRef('pk'), timestamp__gte=past_limit)
.values('timestamp__day')
.distinct()
.annotate(count=Count('timestamp__day'))
.values('count')
query = User.objects
.annotate(modelsb_in_last_seven_days=Subquery(counted_modelb, output_field=IntegerField()))
.filter(modelsb_in_last_seven_days__gt = 2)
This annotates each row in the queryset with the count of all distinct days in modelb for the user, with a date greater than the selected day.
In the subquery I use values('timestamp__day') to make sure I can do distinct() (Because a combination of distinct('timestamp__day') and annotate() is unsupported.)

QuerySet Optimisations in Django

I was just wondering, I have the following two pseudo-related queries:
organisation = Organisation.objects.get(pk=org_id)
employees = Employee.objects.filter(organisation=organisation).filter(is_active=True)
Each Employee has a ForeignKey relationship with Organisation.
I was wondering if there is anything I can leverage to do the above in one Query in the native Django ORM?
Also, would:
employees = Employee.objects.filter(organisation__id=organisation.id).filter(is_active=True)
Be a quicker way to fetch employees?
For Willem's reference, employees is then used as:
# Before constructing **parameters, it is neccessary to filter out any supurfluous key, value pair that do not correspond to model attributes:
if len(request.GET.getlist('gender[]')) > 0:
parameters['gender__in'] = request.GET.getlist('gender[]')
employees = employees.filter(**parameters)
if len(request.GET.getlist('age_group[]')) > 0:
parameters['age_group__in'] = request.GET.getlist('age_group[]')
employees = employees.filter(**parameters)
results = SurveyResult.objects.filter(
user__in=employees,
created_date__range=date_range,
).annotate(
date=TruncDate('created_date'),
).values(
'survey',
'date',
).annotate(
score=Sum('normalized_score'),
participants=Count('user'),
).order_by(
'survey',
'date',
)
I omitted this as it seemed like unnecessary complications to my original goal.
Also, would:
employees = Employee.objects.filter(organisation__id=organisation.id).filter(is_active=True)
Be a quicker way to fetch employees?
No, or perhaps marginally, since that is in essence what the Django ORM will do itself: it will simply obtain the primary key of the organisation and then make a query like the one you describe.
If you do not need the organisation itself, you can query with:
employees = Employee.objects.filter(organisation_id=org_pk, is_active=True)
Furthermore you can for example perform a .select_related(..) [Django-doc] on the organisation, to load the data of the organisation in the same query as the one of the employee, although reducing one extra query, usually does not make that much of a difference. Performance is more an issue if iut results in N+1 queries.
We can for example "piggyback" fetching the Organisation details with fetching the employees, like:
employees = list(
Employee.objects.select_related('organization').filter(
organisation_id=org_pk, is_active=True
)
)
if employees: # at least one employee
organization = employees[0].organization
But anyway, as said before the difference between one or two queries is not that much. It is usually more of a problem if you have N+1 queries. It is a bit of a pitty that Django/Python does not seem to have a Haxl [GitHub] equivalent, to enable fast retrieval of (remote) resources through algebraic analysis.
In case you are interested in the Employee servey results, you can query with:
results = SurveyResult.objects.filter(
user__organization_id=org_pk,
created_date__range=date_range,
).annotate(
date=TruncDate('created_date'),
).values(
'survey',
'date',
).annotate(
score=Sum('normalized_score'),
participants=Count('user'),
).order_by(
'survey',
'date',
)
You can thus omit a separate querying of Employees if you do not need these anyway.
You can furthermore add the filters to your query, like:
emp_filter = {}
genders = request.GET.getlist('gender[]')
if genders:
emp_filter['user__gender__in'] = genders
age_groups = request.GET.getlist('age_group[]')
if age_groups:
emp_filter['user__age_group__in'] = age_groups
results = SurveyResult.objects.filter(
user__organization_id=org_pk,
created_date__range=date_range,
**emp_filter
).annotate(
date=TruncDate('created_date'),
).values(
'survey',
'date',
).annotate(
score=Sum('normalized_score'),
participants=Count('user'),
).order_by(
'survey',
'date',
)
if you have a foreign key relation between organisation and employees then you can get the employees using the select_related like this:
employees = Employee.objects.selected_related('organisation').filter(is_active=True)
OR
organisation = Organisation.objects.get(pk=org_id)
employees =organisation.employee_set.all() #your_employee_model_name_set.all

filtering the order_by relationship in Django ORM

In the below, product has many writers through contributor, and contributor.role_code defines the exact kind of contribution made to the product. Is it possible with the Django ORM to filter the contributors referenced by the order_by() method below? E.g. I want to order products only by contributors such that contributor.role_code in ['A01', 'B01'].
Product.objects.filter(
product_type__name=choices.PRODUCT_BOOK
).order_by(
'contributor__writer__last_name' # filter which contributors it uses?
)
You can do this via an annotation subquery:
Define the Subquery that represents the thing we want to order by
Annotate the original QuerySet with the Subquery
Order by the annotation.
contributors_for_ordering = Contributor.objects.filter( # 1
product=OuterRef('pk'),
role_code__in=['A01', 'B01'],
).values('writer__last_name')
queryset = Product.objects.filter(
product_type__name=choices.PRODUCT_BOOK
).annotate( # 2
writer_last_name=Subquery(contributors_for_ordering[:1]) # Slice [:1] to ensure a single result
).order_by( # 3
'writer_last_name'
)
Note, however, that there is a potential quirk here. If a Product has contributors with both 'A01' and 'B01' we haven't controlled which one will be used for ordering--we'll get whichever the database returns first. You can add an order_by clause to contributors_for_ordering to deal with that.
To filter on specific values, first build your list of accepted values:
accepted_values = ['A01', 'B01']
Then filter for values in this list:
Product.objects.filter(
product_type__name=choices.PRODUCT_BOOK
).filter(
contributor__role_code__in=accepted_values
).order_by(
'contributor__writer__last_name'
)