How to annotate a distinct Count over multiple relationships in Django? - django

Given a model that has more than one kind of connection to a related model (I will call the "parent" model), how could I annotate a queryset with a count of parent model objects that are linked through either connection without counting duplicates?
Example model definitions
Consider an Article model that has 2 links to a parent Publication model that are very similar in meaning.
from django.db import models
class Publication(models.Model):
pass
class Article(models.Model):
publication = models.ForeignKey(Publication, related_name='publications')
owner = models.ForeignKey(Publication, related_name='owned_articles')
Objective
I want to serve a page that is a list of publications. A business requirement is that the number of articles that the publication wishes to take credit for are shown (these publications prefer a generous metric for counting). An article is considered part of the organization if either the "owner" or "publication" field points to it, but no articles should be counted more than once for a single publication. An article may be included in the count of 2 publications if publication points to a different object than owner.
I don't want to execute a query for every publication in the list.
The problem with Count annotations here
Publication.objects.annotate(Count('publications'), Count('owned_articles')) would be trivial. Then I will have count__publications and count__owned_articles.
My problem is that I can't tell how many articles in count__publications were also counted in count__owned_articles. Django doesn't allow me to cram a full queryset into Count, so in this general case of needing extra control of what is counted a special mechanism is needed.
Similar questions
I have found this situation most similar to the question here:
Django annotate count with a distinct field
You could contrive this same general situation by intensifying that question's request by adding another related model in addition to InformationUnit and asking for a count of unique usernames among both related models.

(initial answer, answering my own question with a so-so solution)
The preferable approach would be to start with a Publication queryset, however, I can manage to squeeze a solution out of the Django ORM by pivoting around the Article queryset instead.
Consider as a solution to this problem:
exclusive_owners_qs = Article.objects.exclude(
publication=F('owner')
).annotate(Count('publication')).order_by('publication')
publications_qs = Article.objects.annotate(Count('owner')).order_by('owner')
With this, I can loop over the two querysets and add up the 2 numbers locally inside of python to get the correct counts.
This satisfies the requirements, but it's also not an elegant solution. Eliminating the need for a python loop would be ideal.

I believe the correct answer is using Count("publications", distinct=True), as described here:
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#combining-multiple-aggregations

Related

Django bulk update with data two tables over

I want to bulk update a table with data two tables over. A solution has been given for the simpler case mentioned in the documentation of:
Entry.objects.update(headline=F('blog__name'))
For that solution, see
https://stackoverflow.com/a/50561753/1092940
Expanding from the example, imagine that Entry has a Foreign Key reference to Blog via a field named blog, and that Blog has a Foreign Key reference to User via a field named author. I want the equivalent of:
Entry.objects.update(author_name=F('blog__author__username'))
As in the prior solution, the solution is expected to employ SubQuery and OuterRef.
The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.
The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.
It does not require multiple outer references, you can update with:
from django.db.models import OuterRef, Subquery
author_name = Author.objects.filter(
blogs__id=OuterRef('blog_id')
).values_list(
'username'
)[:1]
Entry.objects.update(
author_name=Subquery(author_name)
)
You here thus specify that you look for an Author with a related Blog with an id equal to the blog_id of the Entry.

What are the alternatives to using django signals?

I am working on a project where I need to recalculate values based on if fields changed or not. Here is an example:
Model1:
field_a = DatetimeField()
calculated_field_1 = ForeignKey(Model2)
Model2:
field_j = DatetimeField()
If field_a changes on model1 I have to recalculate the value for field calculated_field_1 to see if it needs to change as well. The calculations that are done require me querying the database to check values of other models and then determining if the value of the calculated field needs to change.
Example) field_a changes then I would have to do this calculation
result = Model2.objects.filter(field_j__gte=Model1.field_a)
If result.exists():
Model1.field_a = result.first()
Model1.save(update_fields=(‘field_a’,))
This is the most basic example I could think of and the queries can be much more complicated than this.
The project started out with one calculation when a field changed so I decided the best approach was to use django signals. Months later the requirements have changed for the project and now there are several other calculations that I had to implement that are very similar to the example above. I have noticed that my post_save function is getting out of hand and I am just wondering what alternatives there are to using signals. Although the post_save calculations I do now take far less than half a second, for the sake of my question pretend they took a second or more.
A valid answer cannot include doing these calculations on the fly when I pull them from the db. We use a validation framework that requires me to set these values on the model and querying on the fly has been an approach we attempted but for performance reasons it was not viable. Also, on field change one of the requirements is that the user needs to see the results of the calculated field so this has to happen synchronously.
What are some alternative approaches to using this pattern?

Performance: Store likes in PostgreSQL ArrayField (Django example)

I have 2 models: Post and Comment, each can be liked by User.
For sure, total likes should be rendered somewhere near each Post or Comment
But also each User should have a page with all liked content.
So, the most obvious way is just do it with m2m field, which seems will lead to lots of problems in some future.
And what about this?
Post and Comment models should have some
users_liked_ids = ArrayField(models.IntegerField())
User model should also have such fields:
posts_liked_ids = ArrayField(models.IntegerField())
comments_liked_ids = ArrayField(models.IntegerField())
And each time User likes something, two actions are performed:
User's id adds to Post's/Comment's users_liked_ids field
Post's/Comment's id adds to User's posts_liked_ids/comments_liked_ids field
The questions are:
Is it a good plan?
Will it be efficient to make lookups in such approach to get "Is that Post/Comment` was liked but current user
Will it be better to store likes in some separate table, rather then in liked model, but also in ArrayField
Probably better stay with obvious m2m?
1) No.
2) Definitely not.
3) Absolutely, incredibly not. Don't split your data up even further.
4) Yes.
Here are some of the problems:
no referential integrity, since you can't create foreign keys on array elements, meaning you could easily have garbage values in an ID array
data duplication with posts having user ids and users having post ids means it's possible for information to get out of sync (what happens when a user or post is deleted?)
inefficient lookups in match arrays (your #2)
Don't, under any circumstances, do this. You may want to combine your "post" and "comment" models to simplify the relationship, but this is what junction tables are for. Arrays are good for use cases that don't involve foreign keys or the potential for extreme length.

Django .order_by() with .distinct() using postgres

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.

Django: limit queryset to a condition in the latest instance of a related set

I have an hierarchy of models that consists of four levels, all for various good reasons but which to explain would be beyond the scope of this question, I assume.
So here it goes in pseudo python:
class Base(models.Model):
...
class Top(models.Model):
base = FK(Base)
class Middle(models.Model):
top = FK(Top)
created_at = DateTime(...)
flag = BooleanField(...)
class Bottom(models.Model):
middle = FK(Middle)
stored_at = DateTime(...)
title = CharField(...)
Given a title, how do I efficiently find all instances of Base for which that title is met only in the latest (stored_at) Bottom instance of the latest (created_at) Middle instance that has flag set to True?
I couldn't find a way using the ORM, and the way I've seen it, .latest() isn't useful to me on the model that I want to query. The same holds for any convenience methods on the Base model. As I'm no SQL expert, I'd like to make use of the ORM as well as avoid denormalization as much as possible.
Thanks!
So, apparently, without heavily dropping into (some very unwieldy) SQL and not finding any alternative solution, I saw myself forced to resort to denormalized fields on the Base model, just as many as were required for efficiently getting the wanted (filtered) querysets of said model.
These fields were then updated at creation/modificatin time of respective Bottom instances.