Django distinct related querying - django

I have two models:
Model A is an AbstractUserModel and Model B
class ModelB:
user = ForeignKey(User, related_name='modelsb')
timestamp = DateTimeField(auto_now_add=True)
What I want to find is how many users have at least one ModelB object created at least in 3 of the 7 past days.
So far, I have found a way to do it but I know for sure there is a better one and that is why I am posting this question.
I basically split the query into 2 parts.
Part1:
I added a foo method inside the User Model that checks if a user meets the above conditions
def foo(self):
past_limit = starting_date - timedelta(days=7)
return self.modelsb.filter(timestamp__gte=past_limit).order_by('timestamp__day').distinct('timestamp__day').count() > 2
Part 2:
In the Custom User Manager, I find the users that have more than 2 modelsb objects in the last 7 days and iterate through them applying the foo method for each one of them.
By doing this I narrow down the iterations of the required for loop. (basically its a filter function but you get the point)
def boo(self):
past_limit = timezone.now() - timedelta(days=7)
candidates = super().get_queryset().annotate(rc=Count('modelsb', filter=Q(modelsb__timestamp__gte=past_limit))).filter(rc__gt=2)
return list(filter(lambda x: x.foo(), candidates))
However, I want to know if there is a more efficient way to do this, that is without the for loop.

You can use conditional annotation.
I haven't been able to test this query, but something like this should work:
from django.db.models import Q, Count
past_limit = starting_date - timedelta(days=7)
users = User.objects.annotate(
modelsb_in_last_seven_days=Count('modelsb__timestap__day',
filter=Q(modelsb__timestamp__gte=past_limit),
distinct=True))
.filter(modelsb_in_last_seven_days__gte = 3)
EDIT:
This solution did not work, because the distinct option does specify what field makes an entry distinct.
I did some experimenting on my own Django instance, and found a way to make this work using SubQuery. The way this works is that we generate a subquery where we make the distinction ourself.
counted_modelb = ModelB.objects
.filter(user=OuterRef('pk'), timestamp__gte=past_limit)
.values('timestamp__day')
.distinct()
.annotate(count=Count('timestamp__day'))
.values('count')
query = User.objects
.annotate(modelsb_in_last_seven_days=Subquery(counted_modelb, output_field=IntegerField()))
.filter(modelsb_in_last_seven_days__gt = 2)
This annotates each row in the queryset with the count of all distinct days in modelb for the user, with a date greater than the selected day.
In the subquery I use values('timestamp__day') to make sure I can do distinct() (Because a combination of distinct('timestamp__day') and annotate() is unsupported.)

Related

Django query to return percentage of a users with a post

Two models Users (built-in) and Posts:
class Post(models.Model):
post_date = models.DateTimeField(default=timezone.now)
user = models.ForeignKey(User, on_delete=models.CASCADE, null=True, related_name='user_post')
post = models.CharField(max_length=100)
I want to have an API endpoint that returns the percentage of users that have posted. Basically I want SUM(unique users who have posted) / total_users
I have been trying to play around with annotate and aggregate, but I am getting the sum of posts for each users, or the sum of users per post (which is one...). How can I get the sum of posts returned with unique users, divide that by user.count and return?
I feel like I am missing something silly but my brain has gone to mush staring at this.
class PostParticipationAPIView(generics.ListAPIView):
queryset = Post.objects.all()
serializer_class = PostSerializer
def get_queryset(self):
start_date = self.request.query_params.get('start_date')
end_date = self.request.query_params.get('end_date')
# How can I take something like this, divide it by User.objects.all().count() * 100, and assign it to something to return as the queryset?
queryset = Post.objects.filter(post_date__gte=start_date, post_date__lte=end_date).distinct('user').count()
return queryset
My goal is to end up with the endpoint like:
{
total_participation: 97.3
}
Thanks for any guidance.
BCBB
EDIT
OK, I am still struggling a bit. I tried to create a serializer that just had a decimal field for participation_percentage like:
percentage_participation = serializers.DecimalField(max_digits=5, decimal_places=2, max_value=100, min_value=0)
Then I calculate in the view, but I get an error:
Got AttributeError when attempting to get a value for field percentage_participation on serializer ParticipationSerializer.
The serializer field might be named incorrectly and not match any attribute or key on the str instance.
Original exception text was: 'str' object has no attribute 'percentage_participation'.
Error was the same if I made it a CharField (in case there was some string coercion?).
So then I tried to move it to a Serializer Method and put all the calculation logic in there. This calculated fine, but if I had to provide a query_set in the view. If provided a model object, it just returned the percentage as many times as the query (say Posts.objects.all() had a total of 100 posts, it returned the percentage 100 times).
So then I tried to override the get_queryset in the view, but I HAVE to return something. If I just return { "meh", "hello" } then I return the percentage from the SerializerMethodField one time and the end result is exactly what I want.
I just have no idea as to WHY or how to do this correctly.
Thanks for your help.
EDIT #2
OK so I realized why I was only getting one, it was iterating over the string I returned, which was one character. When I returned "meh" it gave me three of the percentage, iterating over each character in the string...
I am not understanding from playing around, reading the docs, or using GoogleFu how to do this properly. I just want to be able to perform some kind of summary logic on records from the DB - how can I do this properly?!?!
Thank you for all your time.
BCBB
something like this should work
# get total user count
total_users = User.objects.count()
# get unique set of users with post
total_users_who_posted = Post.objects.filter(...).distinct("user").count()
# calculate_percentage
percentage = {
"total_participation": (total_users_who_posted*100)/ total_users
}
# take caution of divion by zero
I don't think it is possible to use djangos orm to do this completely but you can use the orm to get the user counts (with posts and total):
from django.db.models import BooleanField, Case, Count, When, Value
counts = (User
.objects
.annotate(posted=Case(When(user_post__isnull=False,
then=Value(True)),
default=Value(False),
output_field=BooleanField()))
.values('posted')
.aggregate(posted_users=Count('pk', filter=Q(posted=True)),
total_users=Count('pk', filter=Q(posted__isnull=False)))
# This will result in a dict containing the following:
# counts = {'posted_users': ...,
# 'total_users': ....}

Django ORM: Get all records and the respective last log for each record

I have two models, the simple version would be this:
class Users:
name = models.CharField()
birthdate = models.CharField()
# other fields that play no role in calculations or filters, but I simply need to display
class UserLogs:
user_id = models.ForeignKey(to='Users', related_name='user_daily_logs', on_delete=models.CASCADE)
reference_date = models.DateField()
hours_spent_in_chats = models.DecimalField()
hours_spent_in_p_channels = models.DecimalField()
hours_spent_in_challenges = models.DecimalField()
# other fields that play no role in calculations or filters, but I simply need to display
What I need to write is a query that will return all the fields of all users, with the latest log (reference_date) for each user. So for n users and m logs, the query should return n records. It is guaranteed that each user has at least one log record.
Restrictions:
the query needs to be written in django orm
the query needs to start from the user model. So Anything that goes like Users.objects... is ok. Anything that goes like UserLogs.objects... is not. That's because of filters and logic in the viewset, which is beyond my control
It has to be a single query, and no iterations in python, pandas or itertools are allowed. The Queryset will be directly processed by a serializer.
I shouldn't have to specify the names of the columns that need to be returned, one by one. The query must return all the columns from both models
Attempt no. 1 returns only user id and the log date (for obvious reasons). However, it is the right date, but I just need to get the other columns:
test = User.objects.select_related("user_daily_logs").values("user_daily_logs__user_id").annotate(
max_date=Max("user_daily_logs__reference_date"))
Attempt no. 2 generates as error (Cannot resolve expression type, unknown output_field):
logs = UserLogs.objects.filter(user_id=OuterRef('pk')).order_by('-reference_date')[:1]
users = Users.objects.annotate(latest_log = Subquery(logs))
This seems impossible taking into account all the restrictions.
One approach would be to use prefetch_related
users = User.objects.all().prefetch_related(
models.Prefetch(
'user_daily_logs',
queryset=UserLogs.objects.filter().order_by('-reference_date'),
to_attr="latest_log"
)
)
This will do two db queries and return all logs for every user which may or not be a problem depending on the number of records. If you need only logs for the current day as the name suggest, you can add that to filter and reduce the number of UserLogs records. Of course you need to get the first element from the list.
users.daily_logs[0]
For that you can create a #property on the User model which could look roughly like this
#property
def latest_log(self):
if not hasattr('daily_logs'):
return None
return self.daily_logs[0]
user.latest_log
You can also go a step further and try the following SubQuery inside Prefetch to limit the queryset to one element but I am not sure on the performance with this one (credits Django prefetch_related with limit).
users = User.objects.all().prefetch_related(
models.Prefetch(
'user_daily_logs',
queryset=UserLogs.objects.filter(id__in=Subquery(UserLogs.objects.filter(user_id=OuterRef('user_id')).order_by('-reference_date').values_list('id', flat=True)[:1] ) ),
to_attr="latest_log"
)
)

How to get a Count based on a subquery?

I am suffering to get a query working despite all I have been trying based on my web search, and I think I need some help before becoming crazy.
I have four models:
class Series(models.Model):
puzzles = models.ManyToManyField(Puzzle, through='SeriesElement', related_name='series')
...
class Puzzle(models.Model):
puzzles = models.ManyToManyField(Puzzle, through='SeriesElement', related_name='series')
...
class SeriesElement(models.Model):
puzzle = models.ForeignKey(Puzzle,on_delete=models.CASCADE,verbose_name='Puzzle',)
series = models.ForeignKey(Series,on_delete=models.CASCADE,verbose_name='Series',)
puzzle_index = models.PositiveIntegerField(verbose_name='Order',default=0,editable=True,)
class Play(models.Model):
puzzle = models.ForeignKey(Puzzle, on_delete=models.CASCADE, related_name='plays')
user = models.ForeignKey(settings.AUTH_USER_MODEL, blank=True,null=True, on_delete=models.SET_NULL, related_name='plays')
series = models.ForeignKey(Series, blank=True, null=True, on_delete=models.SET_NULL, related_name='plays')
puzzle_completed = models.BooleanField(default=None, blank=False, null=False)
...
each user can play any puzzle several times, each time creating a Play record.
that means that for a given set of (user,series,puzzle) we can have several Play records,
some with puzzle_completed = True, some with puzzle_completed = False
What I am trying (unsuccesfully) to achieve, is to calculate for each series, through an annotation, the number of puzzles nb_completed_by_user and nb_not_completed_by_user.
For nb_completed_by_user, I have something which works in almost all cases (I have one glitch in one of my test that I cannot explain so far):
Series.objects.annotate(nb_completed_by_user=Count('puzzles',
filter=Q(puzzles__plays__puzzle_completed=True,
puzzles__plays__series_id=F('id'),puzzles__plays__user=user), distinct=True))
For nb_not_completed_by_user, I was able to make a query on Puzzle that gives me the good answer, but I am not able to transform it into a Subquery expression that works without throwing up an error, or to get a Count expression to give me the proper answer.
This one works:
puzzles = Puzzle.objects.filter(~Q(plays__puzzle_completed=True,
plays__series_id=1, plays__user=user),series=s)
but when trying to move to a subquery, I cannot find the way to use the following expression not to throw the error:ValueError: This queryset contains a reference to an outer query and may only be used in a subquery.
pzl_completed_by_user = Puzzle.objects.filter(plays__series_id=OuterRef('id')).exclude(
plays__puzzle_completed=True,plays__series_id=OuterRef('id'), plays__user=user)
and the following Count expression doesn't give me the right result:
Series.objects.annotate(nb_not_completed_by_user=Count('puzzles', filter=~Q(
puzzle__plays__puzzle_completed=True, puzzle__plays__series_id=F('id'),
puzzle__plays__user=user))
Could anybody explain me how I could obtain both values ?
and eventually to propose me a link which explains clearly how to use subqueries for less-obvious cases than those in the official documentation
Thanks in advance
Edit March 2021:
I recently found two posts which guided me through one potential solution to this specific issue:
Django Count and Sum annotations interfere with each other
and
Django 1.11 Annotating a Subquery Aggregate
I implemented the proposed solution from https://stackoverflow.com/users/188/matthew-schinckel and https://stackoverflow.com/users/1164966/benoit-blanchon
having help classes: class SubqueryCount(Subquery) and class SubquerySum(Subquery)
class SubqueryCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = PositiveIntegerField()
class SubquerySum(Subquery):
template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'
def __init__(self, queryset, column, output_field=None, **extra):
if output_field is None:
output_field = queryset.model._meta.get_field(column)
super().__init__(queryset, output_field, column=column, **extra)
It works extremely well ! and is far quicker than the conventional Django Count annotation.
... at least in SQlite, and probably PostgreSQL as stated by others.
But when I tried in a MariaDB environnement ... it crashed !
MariaDB is apparently not able / not willing to handle correlated subqueries as those are considered sub-optimal.
In my case, as I try to get from the database multiple Count/distinct annotations for each record at the same time, I really see a tremendous gain in performance (in SQLite)
that I would like to replicate in MariaDB.
Would anyone be able to help me figure out a way to implement those helper functions for MariaDB ?
What should template be in this environnement?
matthew-schinckel ?
benoit-blanchon ?
rktavi ?
Going a bit deeper and analysis the Django docs a bit more in details, I was finally able to produce a satisfying way to produce a Count or Sum based on subquery.
For simplifying the process, I defined the following helper functions:
To generate the subquery:
def get_subquery(app_label, model_name, reference_to_model_object, filter_parameters={}):
"""
Return a subquery from a given model (work with both FK & M2M)
can add extra filter parameters as dictionary:
Use:
subquery = get_subquery(
app_label='puzzles', model_name='Puzzle',
reference_to_model_object='puzzle_family__target'
)
or directly:
qs.annotate(nb_puzzles=subquery_count(get_subquery(
'puzzles', 'Puzzle','puzzle_family__target')),)
"""
model = apps.get_model(app_label, model_name)
# we need to declare a local dictionary to prevent the external dictionary to be changed by the update method:
parameters = {f'{reference_to_model_object}__id': OuterRef('id')}
parameters.update(filter_parameters)
# putting '__id' instead of '_id' to work with both FK & M2M
return model.objects.filter(**parameters).order_by().values(f'{reference_to_model_object}__id')
To count the subquery generated through get_subquery:
def subquery_count(subquery):
"""
Use:
qs.annotate(nb_puzzles=subquery_count(get_subquery(
'puzzles', 'Puzzle','puzzle_family__target')),)
"""
return Coalesce(Subquery(subquery.annotate(count=Count('pk', distinct=True)).order_by().values('count'), output_field=PositiveIntegerField()), 0)
To sum the subquery generated through get_subquery on the field field_to_sum:
def subquery_sum(subquery, field_to_sum, output_field=None):
"""
Use:
qs.annotate(total_points=subquery_sum(get_subquery(
'puzzles', 'Puzzle','puzzle_family__target'),'points'),)
"""
if output_field is None:
output_field = queryset.model._meta.get_field(column)
return Coalesce(Subquery(subquery.annotate(result=Sum(field_to_sum, output_field=output_field)).order_by().values('result'), output_field=output_field), 0)
The required imports:
from django.db.models import Count, Subquery, PositiveIntegerField, DecimalField, Sum
from django.db.models.functions import Coalesce
I spent so many hours on solving this ...
I hope that this will save many of you all the frustration I went through figuring out the right way to proceed.

Django queryset get max id's for a filter

I want to get a list of max ids for a filter I have in Django
class Foo(models.Model):
name = models.CharField()
poo = models.CharField()
Foo.objects.filter(name__in=['foo','koo','too']).latest_by_id()
End result a queryset having only the latest objects by id for each name. How can I do that in Django?
Edit: I want multiple objects in the end result. Not just one object.
Edit1: Added __in. Once again I need only latest( as a result distinct) objects for each name.
Something like this.
my_id_list = [Foo.objects.filter(name=name).latest('id').id for name in ['foo','koo','too']]
Foo.objects.filter(id__in=my_id_list)
The above works. But I want a more concise way of doing it. Is it possible to do this in a single query/filter annotate combination?
you can try:
qs = Foo.objects.filter(name__in=['foo','koo','too'])
# Get list of max == last pk for your filter objects
max_pks = qs.annotate(mpk=Max('pk')).order_by().values_list('mpk', flat=True)
# after it filter your queryset by last pk
result = qs.filter(pk__in=max_pks)
If you are using PostgreSQL you can do the following
Foo.objects.order_by('name', '-id').distinct('name')
MySQL is more complicated since is lacks a DISTINCT ON clause. Here is the raw query that is very hard to force Django to generate from ORM function calls:
Foo.objects.raw("""
SELECT
*
FROM
`foo`
GROUP BY `foo`.`name`
ORDER BY `foo`.`name` ASC , `foo`.`id` DESC
""")

Filter models by checking one datetime's month against another

Using Django's ORM, I am trying to find instances of myModel based on two of its datetime variables; specifically, where the months of these two datetimes are not equal. I understand to filter by the value of a modelfield, you can use Django's F( ) expressions, so I thought I'd try something like this:
myModel.objects.filter(fixed_date__month=F('closed_date__month'))
I know this wouldn't find instances where they aren't equal, but I thought it'd be a good first step since I've never used the F expressions before. However, it doesn't work as I thought it should. I expected it to give me a queryset of objects where the value of the fixed_date month was equal to the value of the closed_date month, but instead I get an error:
FieldError: Join on field 'closed_date' not permitted. Did you misspell 'month' for the lookup type?
I'm not sure if what I'm trying to do isn't possible or straightforward with the ORM, or if I'm just making a simple mistake.
It doesn't look like django F objects currently support extracting the month inside a DateTimeField, the error message seems to be stating that the F object is trying to convert the '__' inside the string 'closed_date__month' as a Foreignkey between different objects, which are usually stored as joins inside an sql database.
You could carry out the same query by iterating across the objects:
result = []
for obj in myModel.objects.all():
if obj.fixed_date.month != obj.closed_date.month:
result.append(obj)
or as a list comprehension:
result = [obj for obj in myModel.objects.all() if obj.fixed_date.month != obj.closed_date.month]
Alternatively, if this is not efficient enough, the months for the two dates could be cached as IntegerFields within the model, something like:
class myModel(models.Model):
....other fields....
fixed_date = models.DateTimeField()
closed_date = models.DateTimeField()
fixed_month = models.IntegerField()
closed_month = models.IntegerField()
store the two integers when the relevant dates are updated:
myModel.fixed_month = myModel.fixed_date.month
myModel.save()
Then use an F object to compare the two integer fields:
myModel.objects.filter(fixed_month__ne=F('closed_month'))
The ne modifier will do the not equal test.
Edit - using raw sql
If you are using an sql based database, then most efficient method is to use the .raw() method to manually specify the sql:
myModel.objects.raw('SELECT * FROM stuff_mymodel WHERE MONTH(fixed_date) != MONTH(close_date)')
Where 'stuff_mymodel' is the correct name of the table in the database. This uses the SQL MONTH() function to extract the values from the month fields, and compare their values. It will return a collection of objects.
There is some nay-saying about the django query system, for example: http://charlesleifer.com/blog/shortcomings-in-the-django-orm-and-a-look-at-peewee-a-lightweight-alternative/. This example could be taken as demonstrating another inconsistency in it's query api.
My thinking is this:
class myModel(models.Model):
fixed_date = models.DateTimeField()
closed_date = models.DateTimeField()
def has_diff_months(self):
if self.fixed_date.month != self.closed_date.month:
return True
return False
Then:
[x for x in myModel.objects.all() if x.has_diff_months()]
However, for a truly efficient solution you'd have to use another column. It makes sense to me that it'd be a computed boolean field that is created when you save, like so:
class myModel(models.Model):
fixed_date = models.DateTimeField()
closed_date = models.DateTimeField()
diff_months = models.BooleanField()
#overriding save method
def save(self, *args, **kwargs):
#calculating the value for diff_months
self.diff_months = (self.fixed_date.month != self.closed_date.month)
#aaand... saving:
super(Blog, self).save(*args, **kwargs)
Then filtering would simply be:
myModel.objects.filter(diff_months=True)