Sum aggregation over list items in Django JSONField - django

I'd like to calculate the sum of all elements in a list inside a JSONField via Django's ORM. The objects basically look like this:
[
{"score": 10},
{"score": 0},
{"score": 40},
...
]
There are several problems that made me use a Raw Query in the end (see SQL query below) but I'd like to know if it is possible with Django's ORM.
SELECT id,
SUM(elements.score) AS total_score
FROM my_table,
LATERAL (SELECT
(jsonb_array_elements('results')->'score')::integer AS score
) AS elements
GROUP BY id
ORDER BY total_score DESC
The main problems I faced is that the list in the JSONField needs to be turned into a set via jsonb_array_elements. Afterwards it is impossible to run an aggregate function over the results. Postgres complains:
aggregate function calls cannot contain set-returning function calls
Using a LATERAL FROM -- as widely suggested -- is not possible with the ORM. Not even with Django's .extra() queryset method because it is not possible to specify an additional table that is not quoted in the final query:
Model.objects.annotate(...).extra(
tables="LATERAL (SELECT (jsonb_array_elements('results')->'score')::integer AS score) AS elements"
)
# ERROR: no relation "LATERAL (SELECT ..."

You can annotate the queryset with the score value from the JSONField, Cast it to an integer, retrieve the distinct values, and get the sum of whatever is left. I think the following query should do the trick:
from django.db.models import IntegerField
from django.db.models import Sum
from django.db.models.fields.json import KeyTextTransform
from django.db.models.functions import Cast
Model.objects.annotate(
score=Cast(
KeyTextTransform("score", "JSONField_name"),
IntegerField(),
)
).values("score").distinct().aggregate(Sum("score"))["score__sum"]
Note that you will still have to change the JSONField_name according to your model

Related

annotate group by in django

I'm trying to perform a query in django that is equivalent to this:
SELECT SUM(quantity * price) from Sales GROUP BY date.
My django query looks like this:
Sales.objects.values('date').annotate(total_sum=Sum('price * quantity'))
The above one throws error:
Cannot resolve keyword 'price * quantity' into field
Then I came up with another query after checking this https://stackoverflow.com/a/18220269/12113049
Sales.objects.values('date').annotate(total_sum=Sum('price', field='price*quantity'))
Unfortunately, this is not helping me much. It gives me SUM(price) GROUP BY date instead of SUM(quantity*price) GROUP BY date.
How do I query this in django?
You should be using F expressions to perform operations on fields:
from django.db.models import F
Sales.objects.values('date').annotate(total_sum=Sum(F('price') * F('quantity')))
Edit: assuming that price is a DecimalField and quantity is a IntegerField (of different types) you would need to specify output_field in Sum:
from django.db.models import DecimalField, F
Sales.objects.values('date').annotate(total_sum=Sum(F('price') * F('quantity'), output_field=DecimalField()))

Annotation with a subquery with multiple result in Django

I use postgresql database in my project and I use below example from django documentation.
from django.db.models import OuterRef, Subquery
newest = Comment.objects.filter(post=OuterRef('pk')).order_by('-created_at')
Post.objects.annotate(newest_commenter_email=Subquery(newest.values('email')[:1]))
but instead of newest commenter email, i need last two commenters emails. i changed [:1] to [:2] but this exception raised: ProgrammingError: more than one row returned by a subquery used as an expression.
You'll need to aggregate the subquery results in some way: perhaps by using an ARRAY() construct.
You can create a subclass of Subquery to do this:
class Array(Subquery):
template = 'ARRAY(%(subquery)s)`
output_field = ArrayField(base_field=models.TextField())
(You can do a more automatic method of getting the output field, but this should work for you for now: see https://schinckel.net/2019/07/30/subquery-and-subclasses/ for more details).
Then you can use:
posts = Post.objects.annotate(
newest_commenters=Array(newest.values('email')[:2]),
)
The reason this is happening is because a correlated subquery in postgres may only return one row, with one column. You can use this mechanism to deal with multiple rows, and perhaps use JSONB construction if you need multiple columns.

How to query tuples of columns in Django database queries?

I have some table ports(switch_ip, slot_number, port_number, many, more, columns) and would like to achieve the following PostgreSQL query using Django:
SELECT switch_ip, array_agg((slot_number, port_number, many, more, columns) ORDER BY slot_number, port_number) info
FROM ports
GROUP BY switch_ip
ORDER BY switch_ip
Using django.contrib.postgres.aggregates here's what I got so far:
Port.objects \
.values('switch_ip') \
.annotate(
info=ArrayAgg('slot_number', ordering=('slot_number', 'port_number'))
) \
.order_by('switch_ip')
I am unable to include more than one column in the ArrayAgg. None of ArrayAgg(a, b, c), ArrayAgg((a, b, c)), ArrayAgg([a, b, c]) seem to work. A workaround could involve separate ArrayAggs for each column and each with the same ordering. I would despise this because I have many columns. Is there any nicer workaround, possibly more low-level?
I suspect this is no issue with ArrayAgg itself but rather with tuple expressions in general. Is there any way to have tuples at all in Django queries? For example, what would be the corresponding Django of:
SELECT switch_ip, (slot_number, port_number, many, more, columns) info
FROM ports
If this is not yet possible in Django, how feasible would it be to implement?
I have spent lot of time searching for a working solution and here is a full recipe with code example.
You need to define Array "function" with square brackets in template
from django.db.models.expressions import Func
class Array(Func):
template = '%(function)s[%(expressions)s]'
function = 'ARRAY'
You need to define output field format (it must be array of some django field). For example an array of strings
from django.contrib.postgres.fields import ArrayField
from django.db.models.fields import CharField
out_format = ArrayField(CharField(max_length=200))
Finally make an ArrayAgg expression
from django.db.models import F
annotate = {'2-fields': ArrayAgg(Array(F('field1'), F('field2'), output_field=out_format), distinct=True) }
model.objects.all().annotate(**annotate)
(Optional) If field1 or field2 are not CharFields, you may include Cast as an argument of Array
from django.db.models.functions import Cast
annotate = {'2-fields': ArrayAgg(Array(Cast(F('field1'), output_field=CharField(max_length=200)), F('field2'), output_field=out_format), distinct=True) }
Having done a bit more research I guess one could add the missing tuple functionality as follows:
Create a new model field type named TupleField. The implementation might look kind of similar to django.contrib.postgres.fields.ArrayField. TupleField would be rather awkward because I don't think any RDBMS allows for composite types to be used as column types so usage of TupleField would be limited to (possibly intermediate?) query results.
Create a new subclass of django.db.models.Expression which wraps multiple expressions on its own (like Func in general, so looking at Func's implementation might be worthwile) and evaluates to a TupleField. Name this subclass TupleExpression for example.
Then I could simply annotate with ArrayAgg(TupleExpression('slot_number', 'port_number', 'many', 'more', 'columns'), ordering=('slot_number', 'port_number')) to solve my original problem. This would annotate each switch_ip with correctly-ordered arrays of tuples where each tuple represents one switch port.

Alternative nullif in Django ORM

Use Postgres as db and Django 1.9
I have some model with field 'price'. 'Price' blank=True.
On ListView, I get query set. Next, I want to sort by price with price=0 at end.
How I can write in SQL it:
'ORDER BY NULLIF('price', 0) NULLS LAST'
How write it on Django ORM? Or on rawsql?
Ok. I found alternative. Write own NullIf with django func.
from django.db.models import Func
class NullIf(Func):
template = 'NULLIF(%(expressions)s, 0)'
And use it for queryset:
queryset.annotate(new_price=NullIf('price')).order_by('new_price')
Edit : Django 2.2 and above have this implemented out of the box. The equivalent code will be
from django.db.models.functions import NullIf
from django.db.models import Value
queryset.annotate(new_price=NullIf('price', Value(0)).order_by('new_price')
You can still ORDER BY PRICE NULLS LAST if in your select you select the price as SELECT NULLIF('price', 0). That way you get the ordering you want, but the data is returned in the way you want. In django ORM you would select the price with annotate eg TableName.objects.annotate(price=NullIf('price', 0) and for the order by NULLS LAST and for the order by I'd follow the recommendations here Django: Adding "NULLS LAST" to query
Otherwise you could also ORDER BY NULLIF('price', 0) DESC but that will reorder the other numeric values. You can also obviously exclude null prices from the query entirely if you don't require them.

Django Postgres ArrayField aggregation and filtering

Following on from this question: Django Postgresql ArrayField aggregation
I have an ArrayField of Categories and I would like to retrieve all unique values it has - however the results should be filtered so that only values starting with the supplied string are returned.
What's the "most Django" way of doing this?
Given an Animal model that looks like this:
class Animal(models.Model):
# ...
categories = ArrayField(
models.CharField(max_length=255, blank=True),
default=list,
)
# ...
Then, as per the other question's answer, this works for finding all categories, unfiltered.
all_categories = (
Animal.objects
.annotate(categories_element=Func(F('categories'), function='unnest'))
.values_list('categories_element', flat=True)
.distinct()
)
However, now, when I attempt to filter the result I get failure, not just with __startswith but all types of filter:
all_categories.filter(categories_element__startswith('ga'))
all_categories.filter(categories_element='dog')
Bottom of stacktrace is:
DataError: malformed array literal: "dog"
...
DETAIL: Array value must start with "{" or dimension information.
... and it appears that it's because Django tries to do a second UNNEST - this is the SQL it generates:
...) WHERE unnest("animal"."categories") = dog::text[]
If I write the query in PSQL then it appears to require a subquery as a result of the UNNEST:
SELECT categories_element
FROM (
SELECT UNNEST(animal.categories) as categories_element
) ul
WHERE ul.categories_element like 'Ga%';
Is there a way to get Django ORM to make a working query? Or should I just give up on the ORM and use raw SQL?
You probably have the wrong database design.
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.
http://www.postgresql.org/docs/9.1/static/arrays.html