I have two simple Django models:
class PhotoStream(models.Model):
cover = models.ForeignKey('links.Photo')
creation_time = models.DateTimeField(auto_now_add=True)
class Photo(models.Model):
owner = models.ForeignKey(User)
which_stream = models.ManyToManyField(PhotoStream)
image_file = models.ImageField(upload_to=upload_photo_to_location, storage=OverwriteStorage())
Currently the only data I have is 6 photos, that all belong to 1 photostream. I'm trying the following to prefetch all related photos when forming a photostream queryset:
queryset = PhotoStream.objects.order_by('-creation_time').prefetch_related('photo_set')
for obj in queryset:
print obj.photo_set.all()
#print connection.queries
Checking via the debug toolbar, I've found that the above does exactly the same number of queries it would have done if I remove the prefetch_related part of the statement. It's clearly not working. I've tried prefetch_related('cover') as well - that doesn't work either.
Can anyone point out what I'm doing wrong, and how to fix it? My goal is to get all related photos for every photostream in the queryset. How can I possibly do this?
Printing connection.queries after running the for loop includes, among other things:
SELECT ("links_photo_which_stream"."photostream_id") AS "_prefetch_related_val", "links_photo"."id", "links_photo"."owner_id", "links_photo"."image_file" FROM "links_photo" INNER JOIN "links_photo_which_stream" ON ("links_photo"."id" = "links_photo_which_stream"."photo_id") WHERE "links_photo_which_stream"."photostream_id" IN (1)
Note: I've simplified my models posted in the question, hence the query above doesn't include some fields that actually appear in the output, but are unrelated to this question.
Here are some of the extracts from prefetch_related:
**prefetch_related**, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python.
And, some more:
>>> Pizza.objects.all().prefetch_related('toppings')
This implies a self.toppings.all() for each Pizza; now each time self.toppings.all() is called, instead of having to go to the database for the items, it will find them in a prefetched QuerySet cache that was populated in a single query.
So the number of queries you see will always be the same but if you use prefetch_related then instead of hitting the database on for each photostream it will hit the prefetched QuerySet cache that it already built and get the photo_set from there.
Related
I'm trying to track Foreign Keys using django-field-history, but when I add it, it does additional queries on every page using the Model
For example
from field_history.tracker import FieldHistoryTracker
class Author(models.Model):
user = models.ForeignKey('auth.user)
field_history = FieldHistoryTracker(['user'])
will always give more queries on pages using Author, like so
SELECT ••• FROM "auth_user" WHERE "auth_user"."id" = '2'
1239 similar queries. Duplicated 1235 times.
I've tried using user_id instead of user in Field History Tracker, but it will always return None. Using user.id or anything like it just returns an error.
I really need to keep that history data, but not at the cost of thousands of additional queries.
Also, would really enjoy keeping django-field-history as my whole DB is using it, but I'm aware I might have to switch package, and if so, which one would you advise ?
As far as my understanding goes, you are trying to log which user has updated, for this you should use _field_history_user as described in the documentation.
For example:
class Pizza(models.Model):
name = models.CharField(max_length=255)
updated_by = models.ForeignKey('auth.User')
field_history = FieldHistoryTracker(['name'])
#property
def _field_history_user(self):
return self.updated_by
It would always update which user has updated the row for this table.
I've run into an interesting situation in a new app I've added to an existing project. My goal is to (using a Celery task) update many rows at once with a value that includes annotated aggregated values from foreign keyed objects. Here are some example models that I've used in previous questions:
class Book(models.model):
author = models.CharField()
num_pages = models.IntegerField()
num_chapters = models.IntegerField()
class UserBookRead(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
user_book_stats = models.ForeignKey(UserBookStats)
book = models.ForeignKey(Book)
complete = models.BooleanField(default=False)
pages_read = models.IntegerField()
class UserBookStats(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
total_pages_read = models.IntegerField()
I'm attempting to:
Use the post_save signal from Book instances to update pages_read on related UserBookRead objects when a Book page count is updated.
At the end of the signal, launch a background Celery task to roll up the pages_read from each UserBookRead which was updated, and update the total_pages_read on each related UserBookStats (This is where the problem occurs)
I'm trying to be as lean as possible as far as number of queries- step 1 is complete and only requires a few queries for my actual use case, which seems acceptable for a signal handler, as long as those queries are optimized properly.
Step 2 is more involved, hence the delegation to a background task. I've managed to accomplish most of it in a fairly clean manner (well, for me at least).
The problem I run into is that when annotating the UserBookStats queryset with a total_pages aggregation (the Sum() of all pages_read for related UserBookRead objects), I can't follow that with a straight update of the queryset to set the total_pages_read field.
Here's the code (the Book instance is passed to the task as book):
# use the provided book instance to get the stats which need to be updated
book_read_objects= UserBookRead.objects.filter(book=book)
book_stat_objects = UserBookStats.objects.filter(id__in=book_read_objects.values_list('user_book_stats__id', flat=True).distinct())
# annotate top level stats objects with summed page count
book_stat_objects = book_stat_objects.annotate(total_pages=Sum(F('user_book_read__pages_read')))
# update the objects with that sum
book_stat_objects.update(total_pages_read=F('total_pages'))
On executing the last line, this error is thrown:
django.core.exceptions.FieldError: Aggregate functions are not allowed in this query
After some research, I found an existing Django ticket for this use case here, on which the last comment mentions 2 new features in 1.11 that could make it possible.
Is there any known/accepted way to accomplish this use case, perhaps using Subquery or OuterRef? I haven't had any success trying to fold in the aggregation as a Subquery. The fallback here is:
for obj in book_stat_objects:
obj.total_pages_read = obj.total_pages
obj.save()
But with potentially tens of thousands of records in book_stat_objects, I'm really trying to avoid issuing an UPDATE for each one individually.
I ended up figuring out how to do this with Subquery and OuterRef, but had to take a different approach than I originally expected.
I was able to quickly get a Subquery working, however when I used it to annotate the parent query, I noticed that every annotated value was the first result of the subquery- this was when I realized I needed OuterRef, because the generated SQL wasn't restricting the subquery by anything in the parent query.
This part of the Django docs was super helpful, as was this StackOverflow question. What this process boils down to is that you have to use Subquery to create the aggregation, and OuterRef to ensure the subquery restricts aggregated rows by the parent query PK. At that point, you can annotate with the aggregated value and directly make use of it in a queryset update().
As I mentioned in the question, the code examples are made up. I've tried to adapt them to my actual use case with my changes:
from django.db.models import Subquery, OuterRef
from django.db.models.functions import Coalesce
# create the queryset to use as the subquery, restrict based on the `book_stat_objects` queryset
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats')
# annotate the future subquery with the aggregation of pages_read from each UserBookRead
total_pages = book_reads.annotate(total=Sum(F('pages_read')))
# annotate each stat object with the subquery total
book_stats = book_stats.annotate(total=Coalesce(Subquery(total_pages), 0))
# update each row with the new total pages count
book_stats.update(total_pages_read=F('total'))
It felt odd to create a queryset that cant be used on it's own (trying to evaluate book_reads will throw an error due to the inclusion of OuterRef), but once you examine the final SQL generated for book_stats, it makes sense.
EDIT
I ended up running into a bug with this code a week or two after figuring out this answer. It turned out to be due to a default ordering for the UserBookRead model. As the Django docs state, default ordering is incorporated into any aggregate GROUP BY clauses, so all of my aggregates were off. The solution to that is to clear the default ordering with a blank order_by() when creating the base subquery:
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats').order_by()
I have been mulling over this for a while looking at many stackoverflow questions and going through aggregation docs
I'm needing to get a dataset of PropertyImpressions grouped by date. Here is the PropertyImpression model:
#models.py
class PropertyImpression(models.Model):
'''
Impression data for Property Items
'''
property = models.ForeignKey(Property, db_index=True)
imp_date = models.DateField(auto_now_add=True)
I have tried so many variations of the view code, but I'm posting this code because I consider to be the most logical, simple code, which according to documentation and examples should do what I'm trying to do.
#views.py
def admin_home(request):
'''
this is the home dashboard for admins, which currently just means staff.
Other users that try to access this page will be redirected to login.
'''
prop_imps = PropertyImpression.objects.values('imp_date').annotate(count=Count('id'))
return render(request, 'reportcontent/admin_home.html', {'prop_imps':prop_imps})
Then in the template when using the {{ prop_imps }} variable, it gives me a list of the PropertyImpressions, but are grouped by both imp_date and property. I need this to only group by imp_date, and by adding the .values('imp_date') according to values docs it would just be grouping by that field?
When leaving off the .annotate in the prop_imps variable, it gives me a list of all the imp_dates, which is really close, but when I group by the date field it for some reason groups by both imp_date and property.
Maybe you have defined a default ordering in your PropertyImpression model?
In this case, you should add order_by() before annotate to reset it :
prop_imps = PropertyImpression.objects.values('imp_date').order_by() \
.annotate(count=Count('id'))
It's explained in Django documentation here:
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call. These extra fields are used to group “like” results together and they can make otherwise identical result rows appear to be separate. This shows up, particularly, when counting things.
I'm trying to optimise my queries but prefetch_related insists on joining the tables and selecting all the fields even though I only need the list of ids from the relations table.
You can ignore the 4th query. It's not related to the question.
Related Code:
class Contact(models.Model):
...
Groups = models.ManyToManyField(ContactGroup, related_name='contacts')
...
queryset = Contact.objects.all().prefetch_related('Groups')
Django 1.7 added Prefetch objects which let you customise the queryset used when prefetching.
In particular, see only().
In this case, you'd want something like:
queryset = Contact.objects.all().prefetch_related(
Prefetch('Groups', queryset=Group.objects.all().only('id')))
Suppose I have following models:
class Thing(models.Model):
name = models.CharField(max_length=100)
ratings = models.ManyToManyField('auth.User', through='Rating')
class Rating(models.Model):
user = models.ForeignKey('auth.User')
thing = models.ForeignKey('Thing')
rating = models.IntegerField()
So I have a lot of things, and every user can rate every thing. I also have a view showing a list of all things (and they are huge in numbers) with a rating that user assigned to each of them. I need a way to retreive all the data from database: Thing objects with additional field user_rating taken from at most one (because we have a fixed User) related Rating object.
Trivial solution looks like that:
things = Thing.objects.all()
for thing in things:
try:
thing.user_rating = thing.ratings.objects.get(user=request.user).rating
except Rating.DoesNotExist:
thing.user_rating = None
But the flaw of this approach is obvious: if we have 500 things, we'll do 501 requests to database. Per one page. Per user. And this is the most viewed page of the site. This task is easily solvable with SQL JOINs but in practice I have more complicated schema and I will certainly benefit from Django model framework. So the question is: is it possible to do this Django-way? It would be really strange if it isn't, considering that such tasks are very common.
As I understood, neither annotate(), nor select_related() will help me here.
I guess you should try this:
https://docs.djangoproject.com/en/1.3/ref/models/querysets/#extra
Example
result = Thing.objects.all().extra(select={'rating': 'select rating from ratings where thing_id = id'})
Your result set gets a new field 'rating' for each 'thing' object.
I use this approach in one of my recent projects. It produces one complex query instead of n+1 queries.
Hope this helps :)
Since you are planning to display everything in one page. I can think of this approach. You can give this a try:
Get all the ratings given by the current user and Get all the Things.
Now try to create a dictionary like this:
thing_dict = {}
for thing in Thing.objects.all():
thing_dict[thing] = None
for rating in Rating.objects.filter(user = request.user):
thing_dict[rating.thing] = rating
Now thing_dict contains all the entries of model Thing as keys and has its rating as its value.
May not be the best way. I am keen on seeing what others answer.