I've run into an interesting situation in a new app I've added to an existing project. My goal is to (using a Celery task) update many rows at once with a value that includes annotated aggregated values from foreign keyed objects. Here are some example models that I've used in previous questions:
class Book(models.model):
author = models.CharField()
num_pages = models.IntegerField()
num_chapters = models.IntegerField()
class UserBookRead(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
user_book_stats = models.ForeignKey(UserBookStats)
book = models.ForeignKey(Book)
complete = models.BooleanField(default=False)
pages_read = models.IntegerField()
class UserBookStats(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
total_pages_read = models.IntegerField()
I'm attempting to:
Use the post_save signal from Book instances to update pages_read on related UserBookRead objects when a Book page count is updated.
At the end of the signal, launch a background Celery task to roll up the pages_read from each UserBookRead which was updated, and update the total_pages_read on each related UserBookStats (This is where the problem occurs)
I'm trying to be as lean as possible as far as number of queries- step 1 is complete and only requires a few queries for my actual use case, which seems acceptable for a signal handler, as long as those queries are optimized properly.
Step 2 is more involved, hence the delegation to a background task. I've managed to accomplish most of it in a fairly clean manner (well, for me at least).
The problem I run into is that when annotating the UserBookStats queryset with a total_pages aggregation (the Sum() of all pages_read for related UserBookRead objects), I can't follow that with a straight update of the queryset to set the total_pages_read field.
Here's the code (the Book instance is passed to the task as book):
# use the provided book instance to get the stats which need to be updated
book_read_objects= UserBookRead.objects.filter(book=book)
book_stat_objects = UserBookStats.objects.filter(id__in=book_read_objects.values_list('user_book_stats__id', flat=True).distinct())
# annotate top level stats objects with summed page count
book_stat_objects = book_stat_objects.annotate(total_pages=Sum(F('user_book_read__pages_read')))
# update the objects with that sum
book_stat_objects.update(total_pages_read=F('total_pages'))
On executing the last line, this error is thrown:
django.core.exceptions.FieldError: Aggregate functions are not allowed in this query
After some research, I found an existing Django ticket for this use case here, on which the last comment mentions 2 new features in 1.11 that could make it possible.
Is there any known/accepted way to accomplish this use case, perhaps using Subquery or OuterRef? I haven't had any success trying to fold in the aggregation as a Subquery. The fallback here is:
for obj in book_stat_objects:
obj.total_pages_read = obj.total_pages
obj.save()
But with potentially tens of thousands of records in book_stat_objects, I'm really trying to avoid issuing an UPDATE for each one individually.
I ended up figuring out how to do this with Subquery and OuterRef, but had to take a different approach than I originally expected.
I was able to quickly get a Subquery working, however when I used it to annotate the parent query, I noticed that every annotated value was the first result of the subquery- this was when I realized I needed OuterRef, because the generated SQL wasn't restricting the subquery by anything in the parent query.
This part of the Django docs was super helpful, as was this StackOverflow question. What this process boils down to is that you have to use Subquery to create the aggregation, and OuterRef to ensure the subquery restricts aggregated rows by the parent query PK. At that point, you can annotate with the aggregated value and directly make use of it in a queryset update().
As I mentioned in the question, the code examples are made up. I've tried to adapt them to my actual use case with my changes:
from django.db.models import Subquery, OuterRef
from django.db.models.functions import Coalesce
# create the queryset to use as the subquery, restrict based on the `book_stat_objects` queryset
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats')
# annotate the future subquery with the aggregation of pages_read from each UserBookRead
total_pages = book_reads.annotate(total=Sum(F('pages_read')))
# annotate each stat object with the subquery total
book_stats = book_stats.annotate(total=Coalesce(Subquery(total_pages), 0))
# update each row with the new total pages count
book_stats.update(total_pages_read=F('total'))
It felt odd to create a queryset that cant be used on it's own (trying to evaluate book_reads will throw an error due to the inclusion of OuterRef), but once you examine the final SQL generated for book_stats, it makes sense.
EDIT
I ended up running into a bug with this code a week or two after figuring out this answer. It turned out to be due to a default ordering for the UserBookRead model. As the Django docs state, default ordering is incorporated into any aggregate GROUP BY clauses, so all of my aggregates were off. The solution to that is to clear the default ordering with a blank order_by() when creating the base subquery:
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats').order_by()
Related
Suppose we have the next model:
class Publications(models.Model):
author = ..........
post = ..........
and we don't want duplicate records to be stored in the database.
This could be done with unique togheter on the model:
Meta:
unique_together = (author, post)
or it could be done in the view with something like:
register_exist = Publications.objects.filter(...).exists()
if register_exist == False:
#Code to save the info
What are the advantages or disadvantages of using these methods?
Meta:
unique_together = (author, post)
Constrain at database level. This make the data always consistent no matter what views input the data.
But the other one:
register_exist = Publications.objects.filter(...).exists()
if register_exist == False:
#Code to save the info
Constrain at application level. There might be a cost to query and check if the record is existing or not. And the data might not be consistent among the application when somebody might add new record without this step (by incident or accident), that make the data no longer consistent anymore.
In a nutshell, the unique_together attribute create a UNIQUE constraint whereas the .filter(..) is used to filter the QuerySet wrt the given conditions.
In other words, If you applied unique_together functionality in your model, you can't break that constraint (technically possible, but) even if you try to do so.
I have an SQL query like the following:
select * from results_table order by case
when place = 0 then 1 else 0 end, place
This query sorts positive numbers first, ZEROs next. How can I write this in Django? Better yet, how can I write it in the following way:
Result.objects.filter(...).order_by('positive_place', 'place')
where 'positive_place' exists for certain models. I am reading about annotate but I am not quiet sure how it works yet. I need to write the annotation for every query. Is there a way to write annotation per query set?
An annotation is adding an attribute to each object in a queryset. Attributes can be further filtered and ordered. You can annotate a queryset using conditional expressions and you can make it reusable by calling custom queryset methods from the model manager.
I'm having a hard time understanding your desired ordering but here's an example of how it could be put together.
from django.db import models
from django.db.models import Case, Value as V, When
class ResultQuerySet(models.QuerySet):
def annotate_positive_place(self):
return self.annotate(
positive_place=Case(When(place=0, then=V(1)), default=V(0))
)
class Result(models.Model):
place = models.IntegerField()
objects = ResultQuerySet.as_manager()
Result.objects.annotate_positive_place().order_by('positive_place')
Suppose I have two models:
class Task(Model):
duration = models.IntegerField(default=100)
class Record(Model):
minutes_planned = models.IntegerField(default=0)
task = models.ForeignKey(Task, related_name=records)
I would like to get ahold of all the objects whose total minutes planned across all related Records is lower than the object's duration. I've been having trouble finding a solution in the docs. Could someone point me to it?
Task.objects.filter(duration__gt=F('records__minutes_planned')))
Task.objects.filter(duration__gt=Sum('records__minutes_planned'))
Task.objects.filter(duration__gt=Sum(F('records__minutes_planned')))
but so far nothing has worked. The first one ran successfully, but from what I can tell, it compared them one-by-one instead of to a total of all records.
It seems Sum is restricted to usage only in .aggregate(). However, I would like to retrieve the objects themselves, rather than a set of values, which is what .aggregate() would give me.
UPDATE:
Found this portion the docs that looks promising.
Try using annotate(). You can annotate a field which holds the sum of the minutes_planned of all the Records and then use this value for filtering out the needed Tasks. The query will look something like:
Task.objects.annotate(total_minutes_planned=Sum('records__minutes_planned')).
filter(duration__gt=total_minutes_planned)
Hope this helps.
Here is the final solution written as a model Manager:
from django.db.models import Manager, OuterRef, Subquery, Sum
from django.db.models.functions import Coalesce
class TaskManager(Manager):
def eligible_for_planning(self, user):
from .models import Record
records = Record.objects.filter(task=OuterRef('pk')).order_by().values('task')
minutes_planned = records.annotate(total=Sum('minutes_planned')).values('total')
qs = self.model.objects.filter(user=user, status=ACTIONABLE, duration__gt=Coalesce(Subquery(minutes_planned), 0))
return qs
We're basically constructing a second query to grab the value needed for the first query.
In this case, records is the second query (or SubQuery), and it filters the Records by the pk of Task being queried in this manager.
Then, minutes_planned returns the actual total that will be compared to the Task's duration.
Finally, the whole thing is plugged into the queryset as a Subquery object. Wrap it in a Coalesce and add the default value, should there be no Record objects found. In my case, this is zero.
Reference
I have two models. One is Task model and other is reward model.
class Task(models.Model):
assigned_by = models.CharField(max_length=100)
class Reward(models.Model):
task = model.ForeignKey(Task)
Now I want to return a queryset of Task along with the reward field in it. I tried this query.
search_res = Task.objects.annotate(reward='reward').
I got this error: The annotation 'reward' conflicts with a field on the model.
Please tell how to solve this. I want an field reward in each task object.
To reach your goal with the actual models I would simply use the relations along with the task.
Let's say you have a task (or a queryset of tasks):
t = Task.objects.get(pk=1)
or
for t in Task.objects.all():
you can get the reward like this:
t.reward_set.first()
Take care of exception in case there's no reward actually linked to the task.
That incurs in quite an amount of queries for large datasets, so you could optimize the requests toward the DB with select_related or prefetch_related depending on your needs. Look at the Django docs for that.
I have two simple Django models:
class PhotoStream(models.Model):
cover = models.ForeignKey('links.Photo')
creation_time = models.DateTimeField(auto_now_add=True)
class Photo(models.Model):
owner = models.ForeignKey(User)
which_stream = models.ManyToManyField(PhotoStream)
image_file = models.ImageField(upload_to=upload_photo_to_location, storage=OverwriteStorage())
Currently the only data I have is 6 photos, that all belong to 1 photostream. I'm trying the following to prefetch all related photos when forming a photostream queryset:
queryset = PhotoStream.objects.order_by('-creation_time').prefetch_related('photo_set')
for obj in queryset:
print obj.photo_set.all()
#print connection.queries
Checking via the debug toolbar, I've found that the above does exactly the same number of queries it would have done if I remove the prefetch_related part of the statement. It's clearly not working. I've tried prefetch_related('cover') as well - that doesn't work either.
Can anyone point out what I'm doing wrong, and how to fix it? My goal is to get all related photos for every photostream in the queryset. How can I possibly do this?
Printing connection.queries after running the for loop includes, among other things:
SELECT ("links_photo_which_stream"."photostream_id") AS "_prefetch_related_val", "links_photo"."id", "links_photo"."owner_id", "links_photo"."image_file" FROM "links_photo" INNER JOIN "links_photo_which_stream" ON ("links_photo"."id" = "links_photo_which_stream"."photo_id") WHERE "links_photo_which_stream"."photostream_id" IN (1)
Note: I've simplified my models posted in the question, hence the query above doesn't include some fields that actually appear in the output, but are unrelated to this question.
Here are some of the extracts from prefetch_related:
**prefetch_related**, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python.
And, some more:
>>> Pizza.objects.all().prefetch_related('toppings')
This implies a self.toppings.all() for each Pizza; now each time self.toppings.all() is called, instead of having to go to the database for the items, it will find them in a prefetched QuerySet cache that was populated in a single query.
So the number of queries you see will always be the same but if you use prefetch_related then instead of hitting the database on for each photostream it will hit the prefetched QuerySet cache that it already built and get the photo_set from there.