Optimizing Django queryset related comparisons - django

I have a Django app where users upload photos, and leave comments under them. The data models to reflect these objects are Photo and PhotoComment respectively.
There's a third data model called PhotoThreadSubscription. Whenever a user comments under a photo, the user is subscribed to that particular thread via creating an object in PhotoThreadSubscription. This way, he/she can be apprised of comments left in the same thread by other users subsequently.
class PhotoThreadSubscription(models.Model):
viewer = models.ForeignKey(User)
viewed_at = models.DateTimeField(db_index=True)
which_photo = models.ForeignKey(Photo)
Every time a user comments under a photo, I update the viewed_at attribute of the user's PhotoThreadSubscription object for that particular photo. Any comments by other users that have a submission time of greater than viewed_at for that particular thread are therefore new.
Suppose I have a queryset of comments, all belonging to unique photos that never repeat. I want to traverse through this queryset and find the latest unseen comment.
Currently, I'm trying this in a very DB heavy way:
latest_unseen_comment = PhotoComment(id=1) #i.e. a very old comment
for comment in comments:
if comment.submitted_on > PhotoThreadSubscription.objects.get(viewer=user, which_photo_id=comment.which_photo_id).viewed_at and comment.submitted_on > latest_unseen_comment.submitted_on:
latest_unseen_comment = comment
This is obviously not a good way to do it. For one, I don't want to do DB calls in a for loop. How do I manage the above in one call? Specifically, how do I get the relevant PhotoThreadSubscription queryset in one call, and next, how do I use that to calculate the max_unseen_comment? I'm highly confused right now.
class Photo(models.Model):
owner = models.ForeignKey(User)
image_file = models.ImageField(upload_to=upload_photo_to_location, storage=OverwriteStorage())
upload_time = models.DateTimeField(auto_now_add=True, db_index=True)
latest_comment = models.ForeignKey(blank=True, null=True, on_delete=models.CASCADE)
class PhotoComment(models.Model):
which_photo = models.ForeignKey(Photo)
text = models.TextField(validators=[MaxLengthValidator(250)])
submitted_by = models.ForeignKey(User)
submitted_on = models.DateTimeField(auto_now_add=True)
Please ask for clarification if the question seemed hazy.

I think this will do it in a single query:
latest_unseen_comment = (
comments.filter(which_photo__photothreadsubscription__viewer=user,
which_photo__photothreadsubscription__viewed_at__lt=F("submitted_on"))
.order_by("-submitted_on")
.first()
)
The key here is using F expressions so that the comparison can be done with each comment's individual date, rather than using a single date hardcoded in the query. After filtering the queryset to only include the comments that are unseen, we then order_by the date of the comment and take the first one.

Related

How to get the first record of a 1-N relationship from the main table with Django ORM?

I have a Users table which is FK to a table called Post. How can I get only the last Post that the user registered? The intention is to return a list of users with the last registered post, but when obtaining the users, if the user has 3 posts, the user is repeated 3 times. I'm interested in only having the user once. Is there an alternative that is not unique?
class User(models.Model):
name = models.CharField(max_length=50)
class Post(models.Model):
title = models.CharField(max_length=50)
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='posts', related_query_name='posts')
created = models.DateTimeField(default=timezone.now)
class Meta:
get_latest_by = 'created'
ordering = ['-created']`
I already tried with selected_related and prefetch_related, I keep getting multiple user registrations when they have multiple Posts.
user = User.objects.select_related('posts').all().values_list('id', 'name', 'posts__title', 'posts__created')
This does give me the answer I want, but when I change the created field to sort by date, I don't get the newest record, I always get the oldest.
user = User.objects.select_related('posts').all().values_list('id', 'name', 'posts__title', 'posts__created').distinct('id')
I'm trying to do it without resorting to doing a record-by-record for and getting the most recent Post. I know that this is an alternative but I'm trying to find a way to do it directly with the Django ORM, since there are thousands of records and a for is less than optimal.
In that case your Django ORM query would first filter posts by user then order by created in descending order and get the first element of the queryset.
last_user_post = Post.objects.filter(user__id=1).order_by('-created').first()
Alternatively, you can use an user instance:
user = User.objects.get(id=1)
last_user_post = Post.objects.filter(user=user).order_by('-created').first()

Django project architecture advice

I have a django project and I have a Post model witch look like that:
class BasicPost(models.Model):
author = models.ForeignKey('auth.User', on_delete=models.CASCADE)
published = models.BooleanField(default=False)
created_date = models.DateTimeField(auto_now_add=True)
title = models.CharField(max_length=100, blank=False)
body = models.TextField(max_length=999)
media = models.ImageField(blank=True)
def get_absolute_url(self):
return reverse('basic_post', args=[str(self.pk)])
def __str__(self):
return self.title
Also, I use the basic User model that comes with the basic django app.
I want to save witch posts each user has read so I can send him posts he haven't read.
My question is what is the best way to do so, If I use Many to Many field, should I put it on the User model and save all the posts he read or should I do it in the other direction, put the Many to Many field in the Post model and save for each post witch user read it?
it's going to be more that 1 million + posts in the Post model and about 50,000 users and I want to do the best filters to return unread posts to the user
If I should use the first option, how do I expand the User model?
thanks!
On your first question (which way to go): I believe that ManyToMany by default creates indices in the DB for both foreign keys. Therefore, wherever you put the relation, in User or in BasicPost, you'll have the direct and reverse relationships working through an index. Django will create for you a pivot table with three columns like: (id, user_id, basic_post_id). Every access to this table will index through user_id or basic_post_id and check that there's a unique couple (user_id, basic_post_id), if any. So it's more within your application that you'll decide whether you filter from a 1 million set or from a 50k posts.
On your second question (how to overload User), it's generally recommended to subclass User from the very beginning. If that's too late and your project is too far advanced for that, you can do this in your models.py:
class BasicPost(models.Model):
# your code
readers = models.ManyToManyField(to='User', related_name="posts_already_read")
# "manually" add method to User class
def _unread_posts(user):
return BasicPost.objects.exclude(readers__in=user)
User.unread_posts = _unread_posts
Haven't run this code though! Hope this helps.
Could you have a separate ReadPost model instead of a potentially large m2m, which you could save when a user reads a post? That way you can just query the ReadPost models to get the data, instead of storing it all in the blog post.
Maybe something like this:
from django.utils import timezone
class UserReadPost(models.Model):
user = models.ForeignKey("auth.User", on_delete=models.CASCADE, related_name="read_posts")
seen_at = models.DateTimeField(default=timezone.now)
post = models.ForeignKey(BasicPost, on_delete=models.CASCADE, related_name="read_by_users")
You could add a unique_together constraint to make sure that only one UserReadPost object is created for each user and post (to make sure you don't count any twice), and use get_or_create() when creating new records.
Then finding the posts a user has read is:
posts = UserReadPost.objects.filter(user=current_user).values_list("post", flat=True)
This could also be extended relatively easily. For example, if your BasicPost objects can be edited, you could add an updated_at field to the post. Then you could compare the seen_at of the UserReadPost field to the updated_at field of the BasicPost to check if they've seen the updated version.
Downside is you'd be creating a lot of rows in the DB for this table.
If you place your posts in chronological order (by created_at, for example), your option could be to extend user model with latest_read_post_id field.
This case:
class BasicPost(models.Model):
# your code
def is_read_by(self, user):
return self.id < user.latest_read_post_id

Working with annotation in Django queryset

I need help in a Django annotation.
I have a Django data model called Photo, and another called PhotoStream (one PhotoStream can have many Photos - detailed models at the end). I get the most recent 200 photos simply by: Photo.objects.order_by('-id')[:200]
To every object in the above queryset, I want to annotate the count of all related photos. A related photo is one which is (i) from the same PhotoStream, (ii) whose timestamp is less than or equal to the time stamp of the object in question.
In other words:
for obj in context["object_list"]:
count = Photo.objects.filter(which_stream=obj.which_stream).order_by('-upload_time').exclude(upload_time__gt=obj.upload_time).count()
I'm new to this, and can't seem to translate the for loop above into a queryset annotation. Any help?
Here's the photo and photostream data models with relevant fields:
class Photo(models.Model):
owner = models.ForeignKey(User)
which_stream = models.ForeignKey(PhotoStream)
image_file = models.ImageField(upload_to=upload_photo_to_location, storage=OverwriteStorage())
upload_time = models.DateTimeField(auto_now_add=True, db_index=True)
class PhotoStream(models.Model):
stream_cover = models.ForeignKey(Photo)
children_count = models.IntegerField(default=1)
creation_time = models.DateTimeField(auto_now_add=True)
So far my attempt has been:
Photo.objects.order_by('-id').annotate(num_related_photos=Count('which_stream__photo__upload_time__lte=F('upload_time')))[:200]
Gives an invalid syntax error.
Whereas the following works, but doesn't cater to my timestamp related requirement in (ii) above:
Photo.objects.order_by('-id').annotate(num_related_photos=Count('which_stream__photo'))[:200]
I broke the query down as follows:
Photo.objects.filter(which_stream__photo__upload_time__lte=F('upload_time')).annotate(num_related_photos=Count('which_stream__photo')).order_by('-id')[:200]
It looks counter-intuitive, but works because of the way this creates the underlying SQL. Good explanation here: https://stackoverflow.com/a/7001419/4936905

Filter and count with django

Suppose I have a Post and Vote tables.
Each post can be either liked or disliked (this is the post_type).
class Post(models.Model):
author = models.ForeignKey(User)
title = models.CharField(verbose_name=_("title"), max_length=100, null=True, blank=True)
content = models.TextField(verbose_name=_("content"), unique=True)
ip = models.CharField(verbose_name=_("ip"), max_length=15)
class Vote(models.Model):
user = models.ForeignKey(User)
post = models.ForeignKey(Post)
post_type = models.PositiveSmallIntegerField(_('post_type'))
I want to get posts and annotate each post with number of likes.
What is the best way to do this?
You should make a function in Post model and call this whenever you need the count.
class Post(models.Model):
...
def likes_count(self):
return self.vote_set.filter(post_type=1).count()
Use it like this:
p = Post.objects.get(pk=1)
print p.likes_count()
One approach is to add a method to the Post class that fetches this count, as shown by #sachin-gupta. However this will generate one extra query for every post that you fetch. If you are fetching posts and their counts in bulk, this is not desirable.
You could annotate the posts in bulk but I don't think your current model structure will allow it, because you cannot filter within an annotation. You could consider changing your structure as follows:
class Vote(models.Model):
"""
An abstract vote model.
"""
user = models.ForeignKey(User)
post = models.ForeignKey(Post)
class Meta:
abstract = True
class LikeVote(Vote)
pass
class DislikeVote(Vote)
pass
i.e., instead of storing likes and dislikes in one model, you have a separate model for each. Now, you can annotate your posts in bulk, in a single query:
from django.db.models import Count
posts = Post.objects.all().annotate(Count('likevote_set'))
for post in posts:
print post.likevote__count
Of course, whether or not this is feasible depends on the architecture of the rest of your app, and how many "vote types" you are planning to have. However if you are going to be querying the vote counts of posts frequently then you will need to try and avoid a large number of database queries.

Choosing best Django architecture for scalability / performance

I had a doubt on how to architecture the model.
I want to give some entities the possibility to be voted, in this case, a paper. I came up with this two possibilities:
Option 1:
Link the entity as a relationship
class Vote(model.Model):
author = models.ForeignKey(User)
created = models.DateField(auto_now=True)
value = models.IntegerField(default=1)
class Paper(models.Model):
author = models.ForeignKey(User)
edition = models.ForeignKey(ConferenceEdition)
votes = models.OneToMany(Vote)
advantages:
It's easier to work with the model (ORM)
I can use this vote entity with others
I may need this information when rendering the HTML, to show which papers the user has already voted.
Desavantages:
I'm afraid the largest the database, the slower it can get.
Option 2:
Not to link the class
class Vote(model.Model):
author = models.ForeignKey(User)
created = models.DateField(auto_now=True)
value = models.IntegerField(default=1)
entity_id = models.IntegerField()
entity_type = models.CharField(max_length=255,default='Paper')
class Paper(models.Model):
author = models.ForeignKey(User)
edition = models.ForeignKey(ConferenceEdition)
num_votes = models.IntegerField(default=0)
Avantages:
It's kind of a lazy loading, I have a counter and if I need the information I can go for it.
It's faster ( I think )
Desavantages:
You must rely on a new logic to update all the new votes.
Option 3:
I'm listening
Thanks!
Django loads many to many fields only if you explicitly call them.
So in your 1st case:
paper.votes.all()
If you want to load all the votes when doing your query, you can in django 1.4 do prefetch_related
paper = Paper.objects.get(pk=1).prefetch_related('votes')
By the way, instead of .all() you can use .count(), which generates a different database query that is much faster since it only has to count values, instead of retrieve them into django/python.
There is also a third approach:
You coud have extra field in your model: votes_count, that you would update on pre_save(), and it would hold that value for you. This way you get both: you can query for all votes, but you can also just grab a number.