How to properly prefetch_related in django

How to properly prefetch_related in django - django

One query is more than 10 times faster than the other, it looks like prefetch_related has no effect. How to do it correctly?
# 400ms
test = PZItem.objects.all().aggregate(Sum('quantity'))
# 4000ms
test = PZ.objects.prefetch_related('pzitem_set').aggregate(Sum('pzitem__quantity'))

.aggregate is already a database function. There is no need to prefetch all related items first.
You need prefetch_related for example if you iterate over items:
from django.models.db import Prefetch
test = PZ.objects.all().prefetch_related(
Prefetch(
'pzitem_set',
queryset=PZItem.objects.filter(active=True), # optional qs filter
to_attr='pzitem_list'
)
)
# version 1 with prefetch
for pz in test:
for item in pz.pzitem_list:
# this does NOT hit the db everytime as the pzitem_list is prefetched
print(item)
# version 2 without prefetch
for pz in test:
for item int ps.pzitem_set.all():
# this will hit the database for every pz in the parent loop
print(item)

Related

Django making an efficient way to loop transactions and categories to reduce datebase calls

I have would like to limit the number of times the database is queried for categories in the following code.
I'm aware a QuerySet can be constructed, filtered, sliced, and generally passed around without hitting the database. No database activity occurs until you do something to evaluate the queryset.
In the code below, I rarely have to create a new category if one does not already exist, but the code "appears' to query the database on every item in the response.json loop due to categories.objects.getorcreate line:
categories = Category.objects.all()
__new_transactions = []
for item in response.json:
category = categories.objects.getorcreate(code=item['category_code'])
transaction = Transaction(
category=category
)
__new_transactions.append(transaction)
# Save the transactions to the database in bulk.
Transaction.objects.bulk_create(__new_transactions)
Can I make the above more efficient?
My initial idea was to check if the transaction category is in the categories list. However, that is an extensive list (5000 items) on this line item['category'] in categories
categories = Category.objects.all()
__new_transactions = []
for item in response.json:
# check if the transaction category is in the categories list
if item['category'] in categories:
category=categories.get(code=item['category'])
else:
Category.objects.create(code=item['category'], name=item['category'])
transaction = Transaction(
category=category
)
__new_transactions.append(transaction)
# Save the transactions to the database in bulk.
Transaction.objects.bulk_create(__new_transactions)
What would be the best way to approach this?

provide rows by priority in django

I have the following model in django:
class task(models.Model):
admin = models.BooleanField()
date = modesl.DateField()
I am trying to achieve a filter which provides me with a query_set that prioritize if admin = True
So assume I have these 3 rows :
admin = True , date = 01-01-2019
admin = False , date = 01-01-2019
admin = False , date = 02-02-2019
The output of the query set will be :
admin = True , date = 01-01-2019
admin = False , date = 02-02-2019
it should filter out the row with 01-01-2019 which admin=False because there is already an admin=True row which should take prioritization.
I could do it by fetching all and removing it from the query_set myself, but want to make sure there is no other way of doing it before.

Rather than looping through the QuerySet and removing them yourself, one thing you could do is:
Fetch all the dates where admin is True
Fetch all the objects where either:
i. admin is True
ii. The date is not in part 1 (e.g. admin is False)
This can be achieved with the following:
from django.db.models import Q
true_dates = task.objects.filter(admin=True).values_list("date", flat=True)
results = task.objects.filter(Q(admin=True)|~Q(date__in=true_dates))
This will most likely be more efficient than looping through your results yourself.
Note that since querysets are 'lazy' (this means only evaluated when they absolutely have to be) this will result in just 1 db hit

Tim's answer is close, but incomplete, because he doesn't use Subquery().
This answer provides the same results, without having an additional query hit the database:
from django.db.models import Subquery, Q
dates = Task.objects.filter(admin=True)
tasks = Task.objects.filter(Q(admin=True) | ~Q(date__in=Subquery(dates.values('date')))

django annotate with queryset

I have Users who take Surveys periodically. The system has multiple surveys which it issues at set intervals from the submitted date of the last issued survey of that particular type.
class Survey(Model):
name = CharField()
description = TextField()
interval = DurationField()
users = ManyToManyField(User, related_name='registered_surveys')
...
class SurveyRun(Model):
''' A users answers for 1 taken survey '''
user = ForeignKey(User, related_name='runs')
survey = ForeignKey(Survey, related_name='runs')
created = models.DateTimeField(auto_now_add=True)
submitted = models.DateTimeField(null=True, blank=True)
# answers = ReverseForeignKey...
So with the models above a user should be alerted to take survey A next on this date:
A.interval + SurveyRun.objects.filter(
user=user,
survey=A
).latest('submitted').submitted
I want to run a daily periodic task which queries all users and creates new runs for all users who have a survey due according to this criteria:
For each survey the user is registered:
if no runs exist for that user-survey combo then create the first run for that user-survey combination and alert the user
if there are runs for that survey and none are open (an open run has been created but not submitted so submitted=None) and the latest one's submitted date plus the survey's interval is <= today, create a new run for that user-survey combo and alert the user
Ideally I could create a manager method which would annotate with a surveys_due field like:
users_with_surveys_due = User.objects.with_surveys_due().filter(surveys_due__isnull=False)
Where the annotated field would be a queryset of Survey objects for which the user needs to submit a new round of answers.
And I could issue alerts like this:
for user in users_with_surveys_due.all():
for survey in user.surveys_due:
new_run = SurveyRun.objects.create(
user=user,
survey=survey
)
alert_user(user, run)
However I would settle for a boolean flag annotation on the User object indicating one of the registered_surveys needs to create a new run.
How would I go about implementing something like this with_surveys_due() manager method so Postgres does all the heavy lifting? Is it possible to annotate with a collection objects, like a reverse FK?
UPDATE:
For clarity here is my current task in python:
def make_new_runs_and_alert_users():
runs = []
Srun = apps.get_model('surveys', 'SurveyRun')
for user in get_user_model().objects.prefetch_related('registered_surveys', 'runs').all():
for srvy in user.registered_surveys.all():
runs_for_srvy = user.runs.filter(survey=srvy)
# no runs exist for this registered survey, create first run
if not runs_for_srvy.exists():
runs.append(Srun(user=user, survey=srvy))
...
# check this survey has no open runs
elif not runs_for_srvy.filter(submitted=None).exists():
latest = runs_for_srvy.latest('submitted')
if (latest.submitted + qnr.interval) <= timezone.now():
runs.append(Srun(user=user, survey=srvy))
Srun.objects.bulk_create(runs)
UPDATE #2:
In attempting to use Dirk's solution I have this simple example:
In [1]: test_user.runs.values_list('survey__name', 'submitted')
Out[1]: <SurveyRunQuerySet [('Test', None)]>
In [2]: test_user.registered_surveys.values_list('name', flat=True)
Out[2]: <SurveyQuerySet ['Test']>
The user has one open run (submitted=None) for the Test survey and is registered to one survey (Test). He/She should not be flagged for a new run seeing as there is an un-submitted run outstanding for the only survey he/she is registered for. So I create a function encapsulating the Dirk's solution called get_users_with_runs_due:
In [10]: get_users_with_runs_due()
Out[10]: <UserQuerySet [<User: test#gmail.com>]> . # <-- should be an empty queryset
In [107]: for user in _:
print(user.email, i.has_survey_due)
test#gmail.com True # <-- should be false
UPDATE #3:
In my previous update I had made some changes to the logic to properly match what I wanted but neglected to mention or show the changes. Here is the query function below with comments by the changes:
def get_users_with_runs_due():
today = timezone.now()
survey_runs = SurveyRun.objects.filter(
survey=OuterRef('pk'),
user=OuterRef(OuterRef('pk'))
).order_by('-submitted')
pending_survey_runs = survey_runs.filter(submitted__isnull=True)
surveys = Survey.objects.filter(
users=OuterRef('pk')
).annotate(
latest_submission_date=Subquery(
survey_runs.filter(submitted__isnull=False).values('submitted')[:1]
)
).annotate(
has_survey_runs=Exists(survey_runs)
).annotate(
has_pending_runs=Exists(pending_survey_runs)
).filter(
Q(has_survey_runs=False) | # either has no runs for this survey or
( # has no pending runs and submission date meets criteria
Q(has_pending_runs=False, latest_submission_date__lte=today - F('interval'))
)
)
return User.objects.annotate(has_survey_due=Exists(surveys)).filter(has_survey_due=True)
UPDATE #4:
I tried to isolate the issue by creating a function which would make most of the annotations on the Surveys by user in an attempt to check the annotation on that level prior to querying the User model with it.
def annotate_surveys_for_user(user):
today = timezone.now()
survey_runs = SurveyRun.objects.filter(
survey=OuterRef('pk'),
user=user
).order_by('-submitted')
pending_survey_runs = survey_runs.filter(submitted=None)
return Survey.objects.filter(
users=user
).annotate(
latest_submission_date=Subquery(
survey_runs.filter(submitted__isnull=False).values('submitted')[:1]
)
).annotate(
has_survey_runs=Exists(survey_runs)
).annotate(
has_pending_runs=Exists(pending_survey_runs)
)
This worked as expected. Where the annotations were accurate and filtering with:
result.filter(
Q(has_survey_runs=False) |
(
Q(has_pending_runs=False) &
Q(latest_submission_date__lte=today - F('interval'))
)
)
produced the desired results: An empty queryset where the user should not have any runs due and vice-versa. Why is this not working when making it the subquery and querying from the User model?

To annotate users with whether or not they have a survey due, I'd suggest to use a Subquery expression:
from django.db.models import Q, F, OuterRef, Subquery, Exists
from django.utils import timezone
today = timezone.now()
survey_runs = SurveyRun.objects.filter(survey=OuterRef('pk'), user=OuterRef(OuterRef('pk'))).order_by('-submitted')
pending_survey_runs = survey_runs.filter(submitted__isnull=True)
surveys = Survey.objects.filter(users=OuterRef('pk'))
.annotate(latest_submission_date=Subquery(survey_runs.filter(submitted__isnull=False).values('submitted')[:1]))
.annotate(has_survey_runs=Exists(survey_runs))
.annotate(has_pending_runs=Exists(pending_survey_runs))
.filter(Q(has_survey_runs=False) | Q(latest_submission_date__lte=today - F('interval')) & Q(has_pending_runs=False))
User.objects.annotate(has_survey_due=Exists(surveys))
.filter(has_survey_due=True)
I'm still trying to figure out how to do the other one. You cannot annotate a queryset with another queryset, values must be field equivalents. Also you cannot use a Subquery as queryset parameter to Prefetch, unfortunately. But since you're using PostgreSQL you could use ArrayField to list the ids of the surveys in a wrapped value, but I haven't found a way to do that, as you can't use aggregate inside a Subquery.

Django - annotate against a prefetch QuerySet?

Is it possible to annotate/count against a prefetched query?
My initial query below, is based on circuits, then I realised that if a site does not have any circuits I won't have a 'None' Category which would show a site as Down.
conn_data = Circuits.objects.all() \
.values('circuit_type__circuit_type') \
.exclude(active_link=False) \
.annotate(total=Count('circuit_type__circuit_type')) \
.order_by('circuit_type__monitor_priority')
So I changed to querying sites and using prefetch, which now has an empty circuits_set for any site that does not have an active link. Is there a Django way of creating the new totals against that circuits_set within conn_data? I was going to loop through all the sites manually and add the totals that way but wanted to know if there was a way to do this within the QuerySet instead?
my end result should have a something like:
[
{'circuit_type__circuit_type': 'Fibre', 'total': 63},
{'circuit_type__circuit_type': 'DSL', 'total': 29},
{'circuit_type__circuit_type': 'None', 'total': 2}
]
prefetch query:
conn_data = SiteData.objects.prefetch_related(
Prefetch(
'circuits_set',
queryset=Circuits.objects.exclude(
active_link=False).select_related('circuit_type'),
)
)

I don't think this will work. Its debatable whether it should work. Let's refer to what prefetch_related does.
Returns a QuerySet that will automatically retrieve, in a single batch, related objects for each of the specified lookups.
So what happens here is that two queries are dispatched and two lists are realized. These lists are then partitioned in memory and grouped to the correct parent records.
Count() and annotate() are directives to the DBMS that resolve to SQL
Select Count(id) from conn_data
Because of the way annotate and prefetch_related work I think its unlikely they will play nice together. prefetch_related is just a convenience though. From a practical perspective running two separate ORM queries and assigning them to SiteData records yourself is effectively the same thing. So something like ...
#Gets all Circuits counted and grouped by SiteData
Circuits.objects.values('sitedata_id)'.exclude(active_link=False).select_related('circuit_type').annotate(Count('site_data_id'));
Then you just loop over your SiteData records and assign the counts.

Ok I got what I wanted with this, probably a better way of doing it but it works never the less:
from collections import Counter
import operator
class ConnData(object):
def __init__(self, priority='', c_type='', count=0 ):
self.priority = priority
self.c_type = c_type
self.count = count
def __repr__(self):
return '{} {}'.format(self.__class__.__name__, self.c_type)
# get all the site data
conn_data = SiteData.objects.exclude(Q(site_type__site_type='Data Centre') | Q(site_type__site_type='Factory')) \
.prefetch_related(
Prefetch(
'circuits_set',
queryset=Circuits.objects.exclude(active_link=False).select_related('circuit_type'),
)
)
# create a list for the conns
conns = []
# add items to list of dictionaries with all required fields
for conn in conn_data:
try:
conn_type = conn.circuits_set.all()[0].circuit_type.circuit_type
prioritiy = conn.circuits_set.all()[0].circuit_type.monitor_priority
conns.append({'circuit_type' : conn_type, 'priority' : prioritiy})
except:
# create category for down sites
conns.append({'circuit_type' : 'Down', 'priority' : 10})
# crate new list for class data
conn_counts = []
# create counter data
conn_count_data = Counter(((d['circuit_type'], d['priority']) for d in conns))
# loop through counter data and add classes to list
for val, count in conn_count_data.items():
cc = ConnData()
cc.priority = val[1]
cc.c_type = val[0]
cc.count = count
conn_counts.append(cc)
# sort the classes by priority
conn_counts = sorted(conn_counts, key=operator.attrgetter('priority'))

django generic relation delete performance

I'm working on a project with two genericrelation in a model. I discover that the relations are useless and moreover therea 3 million records that we don't need anymore. Is there any way do delete it fast?
Remove the field on migrations has no effect because is generic.
So I tried
import time
from django.contrib.contenttypes.models import ContentType
from app.core import models as m
# UserInformation has a GenericRelation with Address
c = m.UserInformation.objects.first()
c_type = ContentType.objects.get_for_model(c)
# get all the models records generic related with UserInformation
query = m.Address.objects.filter(content_type_id=c_type.id)
start = time.time()
i=0
stop_iteration = 10
for user in query:
i += 1
user.delete()
if i == stop_iteration:
break
end = time.time()
seconds = end - start
print('Execution of %s deletes: %3d seconds' % (stop_iteration, seconds))
The result:
Execution of 10 deletes: 34 seconds
This means that it will takes 37 days to delete ~1million records
Is there any way to do that quicker?

A generic relation is defined by a content_type and an object_id. If you know the content_type you can find all object_id values and delete them in one query. I don't know the fields in your model but it should be something like this.
# get all related object ids
object_ids = m.Address.objects.filter(content_type_id=c_type.id)\
.values_list('object_id', flat=True)
# delete them in one query
YourModel.objects.filter(id__in=object_ids).delete()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to properly prefetch_related in django - django

One query is more than 10 times faster than the other, it looks like prefetch_related has no effect. How to do it correctly? # 400ms test = PZItem.objects.all().aggregate(Sum('quantity')) # 4000ms test = PZ.objects.prefetch_related('pzitem_set').aggregate(Sum('pzitem__quantity'))

Related

Django making an efficient way to loop transactions and categories to reduce datebase calls

provide rows by priority in django

django annotate with queryset

Django - annotate against a prefetch QuerySet?

django generic relation delete performance

Categories

Resources