Django .order_by() with .distinct() using postgres

Django .order_by() with .distinct() using postgres - django

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).

Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.

Related

Ordering Django querysets using a JSONField's properties

I have a model that kinda looks like this:
class Person(models.Model):
data = JSONField()
The data field has 2 properties, name, and age. Now, lets say I want to get a paginated queryset (each page containing 20 people), with a filter where age is greater than 25, and the queryset is to be ordered in descending order. In a usual setup, that is, a normalized database, I can write this query like so:
person_list_page_1 = Person.objects.filter(age > 25).order_by('-age')[:20]
Now, what is the equivalence of the above when filtering and ordering using keys stored in the JSONField? I have researched into this, and it seems it was meant to be a feature for 2.1, but I can't seem to find anything relevant.
Link to the ticket about it being implemented in the future
I also have another question. Lets say we filter and order using the JSONField. Will the ORM have to get all the objects, filter, and order them before sending the first 20 in such a case? That is, will performance be legitimately slower?
Obviously, I know a normalized database is far better for these things, but my hands are kinda tied.

You can use the postgresql sql syntax to extract subfields. Then they can be used just as any other field on the model in queryset filters.
from django.db.models.expressions import RawSQL
Person.objects.annotate(
age=RawSQL("(data->>'age')::int", [])
).filter(age__gte=25).order_by('-age')[:20]
See the postgresql docs for other operators and functions.
In some cases, you might have to add explicit typecasts (::int, for example)
https://www.postgresql.org/docs/current/static/functions-json.html
Performance will be slower than with a proper field, but it's not bad.

Performance: Store likes in PostgreSQL ArrayField (Django example)

I have 2 models: Post and Comment, each can be liked by User.
For sure, total likes should be rendered somewhere near each Post or Comment
But also each User should have a page with all liked content.
So, the most obvious way is just do it with m2m field, which seems will lead to lots of problems in some future.
And what about this?
Post and Comment models should have some
users_liked_ids = ArrayField(models.IntegerField())
User model should also have such fields:
posts_liked_ids = ArrayField(models.IntegerField())
comments_liked_ids = ArrayField(models.IntegerField())
And each time User likes something, two actions are performed:
User's id adds to Post's/Comment's users_liked_ids field
Post's/Comment's id adds to User's posts_liked_ids/comments_liked_ids field
The questions are:
Is it a good plan?
Will it be efficient to make lookups in such approach to get "Is that Post/Comment` was liked but current user
Will it be better to store likes in some separate table, rather then in liked model, but also in ArrayField
Probably better stay with obvious m2m?

1) No.
2) Definitely not.
3) Absolutely, incredibly not. Don't split your data up even further.
4) Yes.
Here are some of the problems:
no referential integrity, since you can't create foreign keys on array elements, meaning you could easily have garbage values in an ID array
data duplication with posts having user ids and users having post ids means it's possible for information to get out of sync (what happens when a user or post is deleted?)
inefficient lookups in match arrays (your #2)
Don't, under any circumstances, do this. You may want to combine your "post" and "comment" models to simplify the relationship, but this is what junction tables are for. Arrays are good for use cases that don't involve foreign keys or the potential for extreme length.

How to annotate a distinct Count over multiple relationships in Django?

Given a model that has more than one kind of connection to a related model (I will call the "parent" model), how could I annotate a queryset with a count of parent model objects that are linked through either connection without counting duplicates?
Example model definitions
Consider an Article model that has 2 links to a parent Publication model that are very similar in meaning.
from django.db import models
class Publication(models.Model):
pass
class Article(models.Model):
publication = models.ForeignKey(Publication, related_name='publications')
owner = models.ForeignKey(Publication, related_name='owned_articles')
Objective
I want to serve a page that is a list of publications. A business requirement is that the number of articles that the publication wishes to take credit for are shown (these publications prefer a generous metric for counting). An article is considered part of the organization if either the "owner" or "publication" field points to it, but no articles should be counted more than once for a single publication. An article may be included in the count of 2 publications if publication points to a different object than owner.
I don't want to execute a query for every publication in the list.
The problem with Count annotations here
Publication.objects.annotate(Count('publications'), Count('owned_articles')) would be trivial. Then I will have count__publications and count__owned_articles.
My problem is that I can't tell how many articles in count__publications were also counted in count__owned_articles. Django doesn't allow me to cram a full queryset into Count, so in this general case of needing extra control of what is counted a special mechanism is needed.
Similar questions
I have found this situation most similar to the question here:
Django annotate count with a distinct field
You could contrive this same general situation by intensifying that question's request by adding another related model in addition to InformationUnit and asking for a count of unique usernames among both related models.

(initial answer, answering my own question with a so-so solution)
The preferable approach would be to start with a Publication queryset, however, I can manage to squeeze a solution out of the Django ORM by pivoting around the Article queryset instead.
Consider as a solution to this problem:
exclusive_owners_qs = Article.objects.exclude(
publication=F('owner')
).annotate(Count('publication')).order_by('publication')
publications_qs = Article.objects.annotate(Count('owner')).order_by('owner')
With this, I can loop over the two querysets and add up the 2 numbers locally inside of python to get the correct counts.
This satisfies the requirements, but it's also not an elegant solution. Eliminating the need for a python loop would be ideal.

I believe the correct answer is using Count("publications", distinct=True), as described here:
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#combining-multiple-aggregations

Why is the database used by Django subqueries not sticky?

I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?

This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related

Using django prefetch_related() to get the time of last activity

I upgraded to Django 1.7 so I could get Prefetch objects, but I'm having a hard time getting them to behave as expected.
I have an Employee model like this:
class Employee(Human):
... additional Employee Fields ...
def get_last_activity_date(self):
try:
return self.activity_set.all().order_by('-when')[0:1].get().when
except Activity.DoesNotExist:
return None
and activities like this
class Activity(models.Model):
when = models.DateTimeField()
employee = models.ForeignKey(Employee, related_name='activity_set')
I want to use prefetch_related to get the last activity date for this employee. I've tried to express this many ways, but no matter how I do it, it ends up generating another query. My other 2 my prefetch_related parts work as expected, but this one does not ever seem to save me any queries.
I'm using this with Django Rest Framework, so I really need the prefetch_related part to work since I have no way of reaching inside DRF to do the mapping outside of the queryset.
Here is one of the ways that DOES NOT WORK
def get_queryset(self):
return super(EmployeeViewSet, self).get_queryset()\
.prefetch_related('phone_number_set', 'email_address_set')\
.prefetch_related(Prefetch('activity_set', Activity.objects.all().order_by('-when')))\
.order_by('last_name', 'first_name')
Notice that on the activity_set prefetch query that I can't slice to only get the latest entry either which is a concern in terms of how much memory this is going to eat up.
I do actually see the prefetch query take place, but then each employee gets a separate query for that piece of information, meaning I have one bigger wasted query and still get the ~200 querys I'm trying to prevent.
How do you get the prefetch_related to work for me in this case?

I suspect You are missing the point of prefetch_related. The docs state that this is the expected behavior: many queries and the 'joining' in python. If you want less queries you should use select_related Also I'm not sure It would work with your specific models (not stated in the question) since select_related does work well with many to many relations.
UPDATE - the docs:
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django .order_by() with .distinct() using postgres - django

Related

Ordering Django querysets using a JSONField's properties

Performance: Store likes in PostgreSQL ArrayField (Django example)

How to annotate a distinct Count over multiple relationships in Django?

Why is the database used by Django subqueries not sticky?

Using django prefetch_related() to get the time of last activity

Categories

Resources