Django 4 how to do annotation over boolean field queryset? - django

I found this discussion about annotation over querysets. I wonder, if that ever got implemented?
When I run the following annotation:
x = Isolate.objects.all().annotate(sum_shipped=Sum('shipped'))
x[0].sum_shipped
>>> True
shipped is a boolean field, so I would expect here the number of instances where shipped is set to True. Instead, I get True or 1. That is pretty inconvenient behaviour.
There is also this discussion on stackoverflow. However, this only covers django 1.
I would expect something like Sum([True, True, False]) -> 2.
Instead, True is returned.
Seems this was not touched since then. Is that discussion in the second link still the state-of-the-art ???
Is there a better way to do statistics with the content of the database than this, now that Django is in version 4?
My current database is sqlite3 for development and testing, but will be Postgres in production. I would very much like to get database independent results.
SUM(bool_field) on Oracle will return 1 also, because bools in Oracle are just a bit (1 or 0). Postgres has a specific bool type, and you can't sum a True/False without an implicit conversion to int. This is why I say SUM(bool) is only accidentally supported for a subset of databases. The field type that is returned is based on the backend get_db_converters and the actual values that come back.
Example model
class Isolate(models.Model):
isolate_nbr = models.CharField(max_length=10, primary_key=True)
growing = models.BooleanField(default=True)
shipped = models.BooleanField(default=True)
....

I think you have a few options here, an annotate is not one of them at this time.
from django.db.models import Sum, Count
# Use aggregate with a filter
print(Isolate.objects.filter(shipped=True).aggregate(Sum('shipped')))
# Just filter then get the count of the queryset
print(Isolate.objects.filter(shipped=True).count())
# Just use aggregate without filter (this will only aggregate/Sum the True values)
print(Isolate.objects.aggregate(Sum('shipped')))
# WON'T WORK: annotate will only annotate on that row whatever that row's value is, not the aggregate across the table
print(Isolate.objects.annotate(sum_shipped==Count('shipped')).first().sum_shipped)

Related

Return object when aggregating grouped fields in Django

Assuming the following example model:
# models.py
class event(models.Model):
location = models.CharField(max_length=10)
type = models.CharField(max_length=10)
date = models.DateTimeField()
attendance = models.IntegerField()
I want to get the attendance number for the latest date of each event location and type combination, using Django ORM. According to the Django Aggregation documentation, we can achieve something close to this, using values preceding the annotation.
... the original results are grouped according to the unique combinations of the fields specified in the values() clause. An annotation is then provided for each unique group; the annotation is computed over all members of the group.
So using the example model, we can write:
event.objects.values('location', 'type').annotate(latest_date=Max('date'))
which does indeed group events by location and type, but does not return the attendance field, which is the desired behavior.
Another approach I tried was to use distinct i.e.:
event.objects.distinct('location', 'type').annotate(latest_date=Max('date'))
but I get an error
NotImplementedError: annotate() + distinct(fields) is not implemented.
I found some answers which rely on database specific features of Django, but I would like to find a solution which is agnostic to the underlying relational database.
Alright, I think this one might actually work for you. It is based upon an assumption, which I think is correct.
When you create your model object, they should all be unique. It seems highly unlikely that that you would have two events on the same date, in the same location of the same type. So with that assumption, let's begin: (as a formatting note, class Names tend to start with capital letters to differentiate between classes and variables or instances.)
# First you get your desired events with your criteria.
results = Event.objects.values('location', 'type').annotate(latest_date=Max('date'))
# Make an empty 'list' to store the values you want.
results_list = []
# Then iterate through your 'results' looking up objects
# you want and populating the list.
for r in results:
result = Event.objects.get(location=r['location'], type=r['type'], date=r['latest_date'])
results_list.append(result)
# Now you have a list of objects that you can do whatever you want with.
You might have to look up the exact output of the Max(Date), but this should get you on the right path.

Modern methods for filtering a Django annotation?

I'd like to filter an annotation using the Django ORM. A lot of the articles I've found here at SO are fairly dated, targeting Django back in the 1.2 to 1.4 days:
Filtering only on Annotations in Django - This question from 2010 suggests using an extra clause, which isn't recommended by the official Django docs
Django annotation with nested filter - Similar suggestions are provided in this question from 2011.
Django 1.8 adds conditional aggregation, which seems like what I might want, but I can't quite figure out the syntax that I'll eventually need. Here are my models and the scenario I'm trying to reach (I've simplified the models for brevity's sake):
class Project(models.Model):
name = models.CharField()
... snip ...
class Milestone_meta(models.Model):
name = models.CharField()
is_cycle = models.BooleanField()
class Milestone(models.Model):
project = models.ForeignKey('Project')
meta = models.ForeignKey('Milestone_meta')
entry_date = models.DateField()
I want to get each Project (with all its fields), along with the Max(entry_date) and Min(entry_date) for each associated Milestone, but only for those Milestone records whose associated Milestone_meta has the is_cycle flag set to True. In other words:
For every Project record, give me the maximum and minimum Milestone entry_dates, but only when the associated Milestone_meta has a given flag set to True.
At the moment, I'm getting a list of projects, then getting the Max and Min Milestones in a loop, resulting in N+1 database hits (which gets slow, as you'd expect):
pqs = Projects.objects.all()
for p in pqs:
(theMin, theMax) = getMilestoneBounds(p)
# Use values from p and theMin and theMax
...
def getMilestoneBounds(pid):
mqs = Milestone.objects.filter(meta__is_cycle=True)
theData = mqs.aggregate(min_entry=Min('entry_date'),max_entry=Max('entry_date'))
return (theData['min_entry'], theData['max_entry'])
How can I reduce this to one or two queries?
As far as I know, you can not get all required project objects in one query.
However, if you don't need the objects and can work with just their id, one way would be-
Milestone.objects.filter(meta__is_cycle=True).values('project').annotate(min_entry=Min('entry_date')).annotate(max_entry=Max('entry_date'))
It will give a list of dicts having data of distinct projects, you can then use their 'id' to lookup the objects when needed.

Django QuerySet update performance

Which one would be better for performance?
We take a slice of products. which make us impossible to bulk update.
products = Product.objects.filter(featured=True).order_by("-modified_on")[3:]
for product in products:
product.featured = False
product.save()
or (invalid)
for product in products.iterator():
product.update(featured=False)
I have tried QuerySet's in statement too as following.
Product.objects.filter(pk__in=products).update(featured=False)
This line works fine on SQLite. But, it rises following exception on MySQL. So, I couldn't use that.
DatabaseError: (1235, "This version of MySQL doesn't yet support
'LIMIT & IN/ALL/ANY/SOME subquery'")
Edit: Also iterator() method causes re-evaluate the query. So, it is bad for performance.
As #Chris Pratt pointed out in comments, the second example is invalid because the objects don't have update methods. Your first example will require queries equal to results+1 since it has to update each object. That might really be costly if you have 1000 products. Ideally you do want to reduce this to a more fixed expense if possible.
This is a similar situation to another question:
Django: Cannot update a query once a slice has been taken
That being said, you would have to do it in at least 2 queries, but you have to be a bit sneaky on how to construct the LIMIT...
Using Q objects for complex queries:
# get the IDs we want to exclude
products = Product.objects.filter(featured=True).order_by("-modified_on")[:3]
# flatten them into just a list of ids
ids = products.values_list('id', flat=True)
# Now use the Q object to construct a complex query
from django.db.models import Q
# This builds a list of "AND id NOT EQUAL TO i"
limits = [~Q(id=i) for i in ids]
Product.objects.filter(featured=True, *limits).update(featured=False)
In some cases it's acceptable to cache QuerySet in array
products = list(products)
Product.objects.filter(pk__in=products).update(featured=False)
Small optimization with values_list
products_id = list(products.values_list('id', flat=True)
Product.objects.filter(pk__in=products_id).update(featured=False)

Django views - optimum query set in a ForeignKey model

Having the model:
class Notebook(models.Model):
n_id = models.AutoField(primary_key = True)
class Note(models.Model):
b_nbook = models.ForeignKey(Notebook)
the URL pattern passing one parameter:
(r'^(?P<n_id>\d+)/$', 'notebook_notes')
and the following view:
def notebook_notes(request, n_id):
nbook = get_object_or_404(Nbook, pk=n_id)
...
which of the following is the optimum query set, and why? (they both work and pass the notes based to a selected by URL notebook)
notes = nbook.note_set.filter(b_nbook = n_id)
notes = Note.objects.select_related().filter(b_nbook = n_id)
Well you're comparing apples and oranges a bit there. They may return virtually the same, but you're doing different things on both.
Let's take the relational version first. That query is saying get all the notes that belong to nbook. You're then filtering that queryset by only notes that belong to nbook. You're filtering it twice on the same criteria, in effect. Since Django's querysets are lazy, it doesn't really do anything bad, like hit the database multiple times, but it's still unnecessary.
Now, the second version. Here, you're starting with all notes and filtering to just those that belong to the particular notebook. There's only one filter this time, but it's bad form to do it this way. Since it's a relation, you should look it up through the relational format, i.e. nbook.note_set.all(). On this version, though, you're also using select_related(), which wasn't used on the other version.
select_related will attempt to create a join table with any other relations on the model, in this case a Note. However, since the only relation on Note is Notebook and you already have the notebook, it's redundant.
Taking out all the redundancy in those two version leaves you with just:
notes = nbook.note_set.all()
That, too, will return exactly the same results as the other two version, but is much cleaner and standardized.

Select DISTINCT individual columns in django?

I'm curious if there's any way to do a query in Django that's not a "SELECT * FROM..." underneath. I'm trying to do a "SELECT DISTINCT columnName FROM ..." instead.
Specifically I have a model that looks like:
class ProductOrder(models.Model):
Product = models.CharField(max_length=20, promary_key=True)
Category = models.CharField(max_length=30)
Rank = models.IntegerField()
where the Rank is a rank within a Category. I'd like to be able to iterate over all the Categories doing some operation on each rank within that category.
I'd like to first get a list of all the categories in the system and then query for all products in that category and repeat until every category is processed.
I'd rather avoid raw SQL, but if I have to go there, that'd be fine. Though I've never coded raw SQL in Django/Python before.
One way to get the list of distinct column names from the database is to use distinct() in conjunction with values().
In your case you can do the following to get the names of distinct categories:
q = ProductOrder.objects.values('Category').distinct()
print q.query # See for yourself.
# The query would look something like
# SELECT DISTINCT "app_productorder"."category" FROM "app_productorder"
There are a couple of things to remember here. First, this will return a ValuesQuerySet which behaves differently from a QuerySet. When you access say, the first element of q (above) you'll get a dictionary, NOT an instance of ProductOrder.
Second, it would be a good idea to read the warning note in the docs about using distinct(). The above example will work but all combinations of distinct() and values() may not.
PS: it is a good idea to use lower case names for fields in a model. In your case this would mean rewriting your model as shown below:
class ProductOrder(models.Model):
product = models.CharField(max_length=20, primary_key=True)
category = models.CharField(max_length=30)
rank = models.IntegerField()
It's quite simple actually if you're using PostgreSQL, just use distinct(columns) (documentation).
Productorder.objects.all().distinct('category')
Note that this feature has been included in Django since 1.4
User order by with that field, and then do distinct.
ProductOrder.objects.order_by('category').values_list('category', flat=True).distinct()
The other answers are fine, but this is a little cleaner, in that it only gives the values like you would get from a DISTINCT query, without any cruft from Django.
>>> set(ProductOrder.objects.values_list('category', flat=True))
{u'category1', u'category2', u'category3', u'category4'}
or
>>> list(set(ProductOrder.objects.values_list('category', flat=True)))
[u'category1', u'category2', u'category3', u'category4']
And, it works without PostgreSQL.
This is less efficient than using a .distinct(), presuming that DISTINCT in your database is faster than a python set, but it's great for noodling around the shell.
Update:
This is answer is great for making queries in the Django shell during development. DO NOT use this solution in production unless you are absolutely certain that you will always have a trivially small number of results before set is applied. Otherwise, it's a terrible idea from a performance standpoint.