Django ORM: is it possible to inject subqueries? - django

I have a Django model that looks something like this:
class Result(models.Model):
date = DateTimeField()
subject = models.ForeignKey('myapp.Subject')
test_type = models.ForeignKey('myapp.TestType')
summary = models.PositiveSmallIntegerField()
# more fields about the result like its location, tester ID and so on
Sometimes we want to retrieve all the test results, other times we only want the most recent result of a particular test type for each subject. This answer has some great options for SQL that will find the most recent result.
Also, we sometimes want to bucket the results into different chunks of time so that we can graph the number of results per day / week / month.
We also want to filter on various fields, and for elegance I'd like a QuerySet that I can then make all the filter() calls on, and annotate for the counts, rather than making raw SQL calls.
I have got this far:
qs = Result.objects.extra(select = {
'date_range': "date_trunc('{0}', time)".format("day"), # Chunking into time buckets
'rn' : "ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)"})
qs = qs.values('date_range', 'result_summary', 'rn')
qs = qs.order_by('-date_range')
which results in the following SQL:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" ORDER BY "date_range" DESC
which is kind of approaching what I'd like, but now I need to somehow filter to only get the rows where rn = 1. I tried using the 'where' field in extra(), which gives me the following SQL and error:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" WHERE "rn"=1 ORDER BY "date_range" DESC ;
ERROR: column "rn" does not exist
So I think the query that finds "rn" needs to be a subquery - but is it possible to do that somehow, perhaps using extra()?
I know I could do this with raw SQL but it just looks ugly! I'd love to find a nice neat way where I have a filterable QuerySet.
I guess the other option is to have a field in the model that indicates whether it is actually the most recent result of that test type for that subject...

I've found a way!
qs = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
This is based on a different option from the answer I referenced earlier. I can filter or annotate qs with whatever I want.
As an aside, on the way to this solution I tried this:
qq = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
qs = Result.objects.filter(id__in=qq)
Django embeds the subquery just as you want it to:
SELECT ...some fields... FROM "myapp_result"
WHERE ("myapp_result"."id" IN (SELECT "myapp_result"."id" FROM "myapp_result"
WHERE (NOT EXISTS(SELECT * FROM myapp_result as T2
WHERE (T2.subject_id = myapp_result.subject_id AND T2.test_type_id = myapp_result.test_type_id AND T2.time > myapp_result.time)))))
I realised this had more subqueries than I need, but I note it here as I can imagine it being useful to know that you can filter one queryset with another and Django does exactly what you'd hope for in terms of embedding the subquery (rather than, say, executing it and embedding the returned values, which would be horrid.)

Related

Django ORM: Select items where latest status is `success`

I have this model.
class Item(models.Model):
name=models.CharField(max_length=128)
An Item gets transferred several times. A transfer can be successful or not.
class TransferLog(models.Model):
item=models.ForeignKey(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
How can I query for all Items which latest TransferLog was successful?
With "latest" I mean ordered by timestamp.
TransferLog Table
Here is a data sample. Here item1 should not be included, since the last transfer was not successful:
ID|item_id|timestamp |success
---------------------------------------
1 | item1 |2014-11-15 12:00:00 | False
2 | item1 |2014-11-15 14:00:00 | True
3 | item1 |2014-11-15 16:00:00 | False
I know how to solve this with a loop in python, but I would like to do the query in the database.
An efficient trick is possible if timestamps in the log are increasing, that is the end of transfer is logged as timestamp (not the start of transfer) or if you can expect ar least that the older transfer has ended before a newer one started. Than you can use the TransferLog object with the highest id instead of with the highest timestamp.
from django.db.models import Max
qs = TransferLog.objects.filter(id__in=TransferLog.objects.values('item')
.annotate(max_id=Max('id')).values('max_id'), success=True)
It makes groups by item_id in the subquery and sends the highest id for every group to the main query, where it is filtered by success of the latest row in the group.
You can see that it is compiled to the optimal possible one query directly by Django.
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql())
SELECT L.id, L.item_id, L.timestamp, L.success FROM app_transferlog L
WHERE L.success = true AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0 GROUP BY U0.item_id )
(I edited the example result compiled SQL for better readability by replacing many "app_transferlog"."field" by a short alias L.field, by substituting the True parameter directly into SQL and by editing whitespace and parentheses.)
It can be improved by adding some example filter and by selecting the related Item in the same query:
kwargs = {} # e.g. filter: kwargs = {'timestamp__gte': ..., 'timestamp__lt':...}
qs = TransferLog.objects.filter(
id__in=TransferLog.objects.filter(**kwargs).values('item')
.annotate(max_id=Max('id')).values('max_id'),
success=True).select_related('item')
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql()[0])
SELECT L.id, L.item_id, L.timestamp, L.success, I.id, I.name
FROM app_transferlog L INNER JOIN app_item I ON ( L.item_id = I.id )
WHERE L.success = %s AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0
WHERE U0.timestamp >= %s AND U0.timestamp < %s
GROUP BY U0.item_id )
print(qs.query.get_compiler('default').as_sql()[1])
# result
(True, <timestamp_start>, <timestamp_end>)
Useful fields of latest TransferLog and the related Items are acquired by one query:
for logitem in qs:
item = logitem.item # the item is still cached in the logitem
...
The query can be more optimized according to circumstances, e.g. if you are not interested in the timestamp any more and you work with big data...
Without assumption of increasing timestamps it is really more complicated than a plain Django ORM. My solutions can be found here.
EDIT after it has been accepted:
An exact solution for a non increasing dataset is possible by two queries:
Get a set of id of the last failed transfers. (Used fail list, because it is much smaller small than the list of successful tranfers.)
Iterate over the list of all last transfers. Exclude items found in the list of failed transfers.
This way can be be efficiently simulated queries that would otherwise require a custom SQL:
SELECT a_boolean_field_or_expression,
rank() OVER (PARTITION BY parent_id ORDER BY the_maximized_field DESC)
FROM ...
WHERE rank = 1 GROUP BY parent_object_id
I'm thinking about implementing an aggregation function (e.g. Rank(maximized_field) ) as an extension for Django with PostgresQL, how it would be useful.
try this
# your query
items_with_good_translogs = Item.objects.filter(id__in=
(x.item.id for x in TransferLog.objects.filter(success=True))
since you said "How can I query for all Items which latest TransferLog was successful?", it is logically easy to follow if you start the query with Item model.
I used the Q Object which can be useful in places like this. (negation, or, ...)
(x.item.id for x in TransferLog.objects.filter(success=True)
gives a list of TransferLogs where success=True is true.
You will probably have an easier time approaching this from the ItemLog thusly:
dataset = ItemLog.objects.order_by('item','-timestamp').distinct('item')
Sadly that does not weed out the False items and I can't find a way to apply the filter AFTER the distinct. You can however filter it after the fact with python listcomprehension:
dataset = [d.item for d in dataset if d.success]
If you are doing this for logfiles within a given timeperiod it's best to filter that before ordering and distinct-ing:
dataset = ItemLog.objects.filter(
timestamp__gt=start,
timestamp__lt=end
).order_by(
'item','-timestamp'
).distinct('item')
If you can modify your models, I actually think you'll have an easier time using ManyToMany relationship instead of explicit ForeignKey -- Django has some built-in convenience methods that will make your querying easier. Docs on ManyToMany are here. I suggest the following model:
class TransferLog(models.Model):
item=models.ManyToManyField(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
Then you could do (I know, not a nice, single-line of code, but I'm trying to be explicit to be clearer):
results = []
for item in Item.objects.all():
if item.transferlog__set.all().order_by('-timestamp')[0].success:
results.append(item)
Then your results array will have all the items whose latest transfer was successful. I know, it's still a loop in Python...but perhaps a cleaner loop.

Django query aggregate upvotes in backward relation

I have two models:
Base_Activity:
some fields
User_Activity:
user = models.ForeignKey(settings.AUTH_USER_MODEL)
activity = models.ForeignKey(Base_Activity)
rating = models.IntegerField(default=0) #Will be -1, 0, or 1
Now I want to query Base_Activity, and sort the items that have the most corresponding user activities with rating=1 on top. I want to do something like the query below, but the =1 part is obviously not working.
activities = Base_Activity.objects.all().annotate(
up_votes = Count('user_activity__rating'=1),
).order_by(
'up_votes'
)
How can I solve this?
You cannot use Count like that, as the error message says:
SyntaxError: keyword can't be an expression
The argument of Count must be a simple string, like user_activity__rating.
I think a good alternative can be to use Avg and Count together:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).order_by(
'-a', '-c'
)
The items with the most rating=1 activities should have the highest average, and among the users with the same average the ones with the most activities will be listed higher.
If you want to exclude items that have downvotes, make sure to add the appropriate filter or exclude operations after annotate, for example:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).filter(user_activity__rating__gt=0).order_by(
'-a', '-c'
)
UPDATE
To get all the items, ordered by their upvotes, disregarding downvotes, I think the only way is to use raw queries, like this:
from django.db import connection
sql = '''
SELECT o.id, SUM(v.rating > 0) s
FROM user_activity o
JOIN rating v ON o.id = v.user_activity_id
GROUP BY o.id ORDER BY s DESC
'''
cursor = connection.cursor()
result = cursor.execute(sql_select)
rows = result.fetchall()
Note: instead of hard-coding the table names of your models, get the table names from the models, for example if your model is called Rating, then you can get its table name with Rating._meta.db_table.
I tested this query on an sqlite3 database, I'm not sure the SUM expression there works in all DBMS. Btw I had a perfect Django site to test, where I also use upvotes and downvotes. I use a very similar model for counting upvotes and downvotes, but I order them by the sum value, stackoverflow style. The site is open-source, if you're interested.

Mutliple fields in the where clause of a QuerySet

In a e-shop application I have a model for storing history of order processing:
class OrderStatusHistory(models.Model):
order = models.ForeignKey(Order, related_name='status_history')
status = models.IntegerField()
date = models.DateTimeField(auto_now_add=True)
Now I want to get most recent status for each order. In pure SQL I can do this in a single query:
select * from order_orderstatushistory where (order_id, date) in (select order_id, max(date) from order_orderstatushistory group by order_id)
What is the best way do this in Django?
I have two options. The first is:
# Get the most recent status change date for each order
v = Order.objects.annotate(Max('status_history__date')).values_list('id', 'status_history__date__max')
# Get OrderStatusHistory objects
hist = OrderStatusHistory.objects.extra(where=['(order_id, date) IN %s'], params=[tuple(v)])
And the second:
hist = OrderStatusHistory.objects.extra(where=['(order_id, date) IN (select order_id, max(date) from order_orderstatushistory group by order_id)'])
The first option is pure Django, but results in 2 database queries and a large list of parameters passed from Django to database engine.
The second option requires to put SQL code directly into my application, but I'd like to avoid this.
Is there equivalent of where (order_id, date) in (select ...) in Django?
Have you tried using this query
OrderStatusHistory.objects.order_by('order', '-date').distinct('order')

chain filter and exclude on django model with field lookups that span relationships

I have the following models:
class Order_type(models.Model):
description = models.CharField()
class Order(models.Model):
type= models.ForeignKey(Order_type)
order_date = models.DateField(default=datetime.date.today)
status = models.CharField()
processed_time= models.TimeField()
I want a list of the order types that have orders that meet this criteria: (order_date <= today AND processed_time is empty AND status is not blank)
I tried:
qs = Order_type.objects.filter(order__order_date__lte=datetime.date.today(),\
order__processed_time__isnull=True).exclude(order__status='')
This works for the original list of orders:
orders_qs = Order.objects.filter(order_date__lte=datetime.date.today(), processed_time__isnull=True)
orders_qs = orders_qs.exclude(status='')
But qs isn't the right queryset. I think its actually returning a more narrowed filter (since no records are present) but I'm not sure what. According to this (django reference), because I'm referencing a related model I think the exclude works on the original queryset (not the one from the filter), but I don't get exactly how.
OK, I just thought of this, which I think works, but feels sloppy (Is there a better way?):
qs = Order_type.objects.filter(order__id__in=[o.id for o in orders_qs])
What's happening is that the exclude() query is messing things up for you. Basically, it's excluding any Order_type that has at least one Order without a status, which is almost certainly not what you want to happen.
The simplest solution in your case is to use order__status__gt='' in you filter() arguments. However, you will also need to append distinct() to the end of your query, because otherwise you'd get a QuerySet with multiple instances of the same Order_type if it has more than one Order that matches the query. This should work:
qs = Order_type.objects.filter(
order__order_date__lte=datetime.date.today(),
order__processed_time__isnull=True,
order__status__gt='').distinct()
On a side note, in the qs query you gave at the end of the question, you don't have to say order__id__in=[o.id for o in orders_qs], you can simply use order__in=orders_qs (you still also need the distinct()). So this will also work:
qs = Order_type.objects.filter(order__in=Order.objects.filter(
order_date__lte=datetime.date.today(),
processed_time__isnull=True).exclude(status='')).distinct()
Addendum (edit):
Here's the actual SQL that Django issues for the above querysets:
SELECT DISTINCT "testapp_order_type"."id", "testapp_order_type"."description"
FROM "testapp_order_type"
LEFT OUTER JOIN "testapp_order"
ON ("testapp_order_type"."id" = "testapp_order"."type_id")
WHERE ("testapp_order"."order_date" <= E'2010-07-18'
AND "testapp_order"."processed_time" IS NULL
AND "testapp_order"."status" > E'' );
SELECT DISTINCT "testapp_order_type"."id", "testapp_order_type"."description"
FROM "testapp_order_type"
INNER JOIN "testapp_order"
ON ("testapp_order_type"."id" = "testapp_order"."type_id")
WHERE "testapp_order"."id" IN
(SELECT U0."id" FROM "testapp_order" U0
WHERE (U0."order_date" <= E'2010-07-18'
AND U0."processed_time" IS NULL
AND NOT (U0."status" = E'' )));
EXPLAIN reveals that the second query is ever so slightly more expensive (cost of 28.99 versus 28.64 with a very small dataset).

Self join with django ORM

I have a model:
class Trades(models.Model):
userid = models.PositiveIntegerField(null=True, db_index=True)
positionid = models.PositiveIntegerField(db_index=True)
tradeid = models.PositiveIntegerField(db_index=True)
orderid = models.PositiveIntegerField(db_index=True)
...
and I want to execute next query:
select *
from trades t1
inner join trades t2
ON t2.tradeid = t1.positionid and t1.tradeid = t2.positionid
can it be done without hacks using Django ORM?
Thx!
select * ...
will take more work. If you can trim back the columns you want from the right hand side
table=SomeModel._meta.db_table
join_column_1=SomeModel._meta.get_field('field1').column
join_column_2=SomeModel._meta.get_field('field2').column
join_queryset=SomeModel.objects.filter()
# Force evaluation of query
querystr=join_queryset.query.__str__()
# Add promote=True and nullable=True for left outer join
rh_alias=join_queryset.query.join((table,table,join_column_1,join_column_2))
# Add the second conditional and columns
join_queryset=join_queryset.extra(select=dict(rhs_col1='%s.%s' % (rhs,join_column_2)),
where=['%s.%s = %s.%s' % (table,join_column_2,rh_alias,join_column_1)])
Add additional columns to have available to the select dict.
The additional constraints are put together in a WHERE after the ON (), which your SQL engine may optimize poorly.
I believe Django's ORM doesn't support doing a join on anything that isn't specified as a ForeignKey (at least, last time I looked into it, that was a limitation. They're always adding features though, so maybe it snuck in).
So your options are to either re-structure your tables so you can use proper foreign keys, or just do a raw SQL Query.
I wouldn't consider a raw SQL query a "hack". Django has good documentation on how to do raw SQL Queries.