See Filtering unique values for the problem description, sample data and postgres query. I'd like to convert the SQL to a queryset. I feel like I'm close but not quite.
SELECT Column_A, Column_B, Column_C, 0 as RN
FROM TABLE
WHERE COLUMN_C is null and Column_B in (UserA, UserB, UserC)
UNION ALL
SELECT Column_A, Column_B, Column_C, RN
FROM (
SELECT A.*, ROW_NUMBER() over (partition by A.column_C Order by case A.column_B when 'UserA' then 0 else 1 end, U.Time_Created) rn
FROM Table A
INNER JOIN user U
on U.Column_B = A.Column_B
WHERE A.Column_C is not null and ColumnB in (userA, userB, UserC)) B
WHERE RN = 1
This is what I have so far:
qs1 = Table.objects.filter(Column_C__isnull=True).annotate(rn=Value(0))
qs2 = Table.objects.annotate(rn=Window(
expression=RowNumber(),
partition_by=[Column_C],
order_by=[Case(When(Column_B=UserA, then=0), default=1), 'Table_for_Column_B__time_created']
)).filter(Column_C__isnull=False, rn=1)
return qs2.union(qs1)
This doesn't quite work.
django.db.utils.NotSupportedError: Window is disallowed in the filter clause.
Next, I tried pulling the intermediate result in a subquery, to allow for filtering in the outer query, since I only really need rows with row number = 1.
qs1 = Table.objects.filter(Column_C__isnull=True).annotate(rn=Value(0))
qs2 = Table.objects.annotate(rn=Window(
expression=RowNumber(),
partition_by=[Column_C],
order_by=[Case(When(Column_B=UserA, then=0), default=1), 'Table_for_Column_B__time_created']
)).filter(pk=OuterRef('pk'))
qs3 = Table.objects.annotate(rn=Subquery(qs2.values('rn'))).filter(Column_C__isnull=False, rn=1)
return qs3.union(q1)
No exceptions this time, but this doesn't work. Every row in the table gets row_number=1 annotated. From the original example, the queryset returns all 7 rows instead of filtering to 5.
Is it possible to filter on window expressions?
What's the best practices to keep in mind when converting window queries to subqueries?
Is there a better way to structure the queryset?
You should be able to do this without a window expression, using a SubQuery
First create a queryset for the subquery that orders by the Column_B=UserA match and then time_created
from django.db.models import Case, When, Q, Subquery, OuterRef
tables_ordered = Table.objects.filter(
Column_C=OuterRef('Column_C')
).annotate(
user_match=Case(When(Column_B=UserA, then=0), default=1)
).order_by('user_match', 'time_created')
Then this subquery returns the first pk for the matched Column_C from the OuterRef, similar to selecting the first row from your window function
first_pk_for_each_column_c = Subquery(tables_ordered.values('pk')[:1])
Then use two Q objects to create an OR that selects the row if Column_C is NULL or the pk matches the first pk from the subquery
Table.objects.filter(
Q(Column_C__isnull=True) | Q(pk=first_pk_for_each_column_c)
)
Related
I am trying to figure out how to reproduce the following using Django - anyone help?
INSERT INTO table1 (table2_id, a_field)
SELECT table2.id as table2_id, table3.a_field
FROM table2
INNER JOIN table3 ON
table3.table2_id == table2.id
WHERE table2.id = 123
If I've got this correct (not my original query ;-) ), this is doing the following:
Creating an entry in table1 where...
a field named table2_id will match the id of a row in table2 and
a field named a_field will match the same named field in a_field in a row of table3 and
the table2/table3 objects from which these values are read are identified by a shared table2.id/table3.table_id2 relationship and also the table2 id being 123.
I don't see how such "calculated" field values can be passed to a create() or get_or_create() style command. It this perhaps possible using Q() objects?
Django model is an ORM framework.
In ORM way, you need
get table2Entity
construct a new table1Entity with table2Entity and related table3Entity values
save the table1Entity
def batch_save_entity2():
entity2 = Table2Entity.objects.get('123')
entity1 = Table1Entity()
entity1.table2_id = entity2.id
entity1.a_field = entity2.entity3.a_field
entity1.save()
or just execute sql directly without ORM
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute('''
INSERT INTO table1 (table2_id, a_field)
SELECT table2.id as table2_id, table3.a_field
FROM table2
INNER JOIN table3 ON
table3.table2_id == table2.id
WHERE table2.id = 123''')
I ran into a surprising conundrum with Window Functions on Filtered QuerySets.
Consider two models: mymodel and relatedmodel where there is a one to many relationship (i.e. relatedmodel has a ForeignKey into mymodel).
I am using something like this:
window_lag = Window(expression=Lag("pk"), order_by=order_by)
window_lead = Window(expression=Lead("pk"), order_by=order_by)
window_rownnum = Window(expression=RowNumber(), order_by=order_by)
qs1 = mymodel.objects.filter(relatedmodel__field=XXX)
qs2 = qs1.annotate(row=window_rownnum, prior=window_lag, next=window_lead)
qs3 = qs2.filter(pk=myid)
which returns a lovely result for the object pk=myid. I now know its position in the filtered list and its prior and next and I use this to great effect in browsing filtered lists.
Significantly len(qs1) = len(qs2) is the size of the list and len(qs3)=1
Alas I just discovered this breaks badly when the filter is less specific:
window_lag = Window(expression=Lag("pk"), order_by=order_by)
window_lead = Window(expression=Lead("pk"), order_by=order_by)
window_rownnum = Window(expression=RowNumber(), order_by=order_by)
qs1 = mymodel.objects.filter(relatedmodel__field__contains=X)
qs2 = qs1.annotate(row=window_rownnum, prior=window_lag, next=window_lead)
qs3 = qs2.filter(pk=myid)
In this instance, qs2 suddenly has more rows in it that qs1! And len(qs2)>len(qs1).
Which totally breaks the browser in a sense (as prior and next are not reliable any more). The extra rows are duplicate mymodel objects, wherever more than one relatedmodel object matches the criterion.
I've traced this to the generated SQL.
This is the form of qs1's SQL:
SELECT DISTINCT
"mymodel"."id", "mymodel"."order_by" ....
FROM "mymodel"
INNER JOIN "relatedmodel" ON ("mymodel"."id" = "relatedmodel"."mymodel_id")
WHERE ("related_model"."field"::text LIKE '%X%')
ORDER BY "mymodel"."order_by" ASC
And this runs fine in my database engine as a SQL query, and returns the same number of rows that Django sees. All good.
Then the SQL that qs2 produces resembles:
SELECT DISTINCT
ROW_NUMBER() OVER (ORDER BY "mymodel"."order_by" ASC) AS "row",
LAG("mymodel"."id", 1) OVER (ORDER BY "mymodel"."order_by" ASC) AS "prior,
LEAD("mymodel"."id", 1) OVER (ORDER BY "mymodel"."order_by" ASC) AS "next",
"mymodel"."id", "mymodel"."order_by" ....
FROM "mymodel"
INNER JOIN "relatedmodel" ON ("mymodel"."id" = "relatedmodel"."mymodel_id")
WHERE ("related_model"."field"::text LIKE '%X%')
ORDER BY "mymodel"."order_by" ASC
And again this produces the same number of rows as I see in Django but that's more than qs1 when the relatedmodel matches more than once.
I can doctor the SQL and get what I want, namely by windowing after filtering:
SELECT
ROW_NUMBER() OVER (ORDER BY "mymodel"."order_by" ASC) AS "row"
LAG("mymodel"."id", 1) OVER (ORDER BY "mymodel"."order_by" ASC) AS "prior,
LEAD("mymodel"."id", 1) OVER (ORDER BY "mymodel"."order_by" ASC) AS "next",
"id", "order_by" ....
FROM (
SELECT DISTINCT
"mymodel"."id", "mymodel"."order_by" ....
FROM "mymodel"
INNER JOIN "relatedmodel" ON ("mymodel"."id" = "relatedmodel"."mymodel_id")
WHERE ("related_model"."field"::text LIKE '%X%')
ORDER BY "mymodel"."order_by" ASC
) AS QS
Which works beautifully and returns the same number of rows as qs1 again.
Adding just one window function inside the SELECT causes the DISTINCT to fail for some reason. DISTINCT works fine without a window function (returning only unique mymodel rows) but adding a window function breaks this.
Using the filter as a subquery on the window function works.
And Django supports subqueries, but I can't find a way to apply them here.
So I wonder if there is a way to do this. Namely, to apply the annotation as a wrapper around the QuerySet rather than as additional columns in the queryset.
With model relations like this
Risk <-- RiskGroup --> Group
I am trying to achieve a query resembling this
SELECT risk.id,
(
SELECT array_agg(bg.name) as names
FROM hazards_riskgroup rgroup
JOIN (SELECT * FROM base_group ORDER BY name) as bg
on rgroup.group_id = bg.id
WHERE risk_id = risk.id
GROUP BY risk_id
ORDER BY risk_id
) AS conf
FROM hazards_risk risk
ORDER BY conf NULLS LAST
;
And I have gotten as far as
group_names = (
RiskGroup.objects
.filter(risk_id=OuterRef('pk'))
.annotate(names=ArrayAgg('group__name'))
.order_by()
.order_by("group__name")
.values('names')
)
qs = Risk.objects.annotate(
conf=Subquery(group_names)
).order_by(F('conf').asc(nulls_last=True))
But that wil produce something along the lines of this query
SELECT risk.id,
(SELECT ARRAY_AGG(U2."name") AS "names"
FROM "hazards_riskgroup" U0
INNER JOIN "base_group" U2 ON (U0."group_id" = U2."id")
WHERE U0."risk_id" = (risk.id)
GROUP BY U0."id", U2."name"
ORDER BY U2."name" ASC) AS "conf"
FROM hazards_risk risk
ORDER BY conf NULLS LAST
Notice that the generated GROUP BY becomes GROUP BY U0."id", U2."name", where I want just GROUP BY U2."name".
Best I've been able to gather, is that it is related to some default ordering on models, but as you can tell, I've already accounted for that by inserting .order_by().
So I'm a little lost. I tried annotating with Raw sql instead, but in that case, I can't seem to be able to reference risk.id which is important to get the right group names for each risk.
I have a model named Brand with a field named "name", so this works fine:
brands = (Brand
.objects
.annotate(as_string=models.functions.Concat('name',models.Value("''"))))
I have a model named Item with a foreign key to Brand. The following does NOT work:
annotated_brands = brands.filter(pk=models.OuterRef('brand'))
(Item
.objects
.annotate(brand_string=models.Subquery(annotated_brands.values('as_string'))))
Specifically, I get a ProgrammingError:
missing FROM-clause entry for table "policeinventory_item"
LINE 1: ... FROM "PoliceInventory_brand" U0 WHERE U0."id" = (PoliceInve...
This is born out when I inspect the SQL query. Here is the broken query:
'SELECT "PoliceInventory_item"."id", "PoliceInventory_item"."_created", "PoliceInventory_item"."_created_by_id", "PoliceInventory_item"."_last_updated", "PoliceInventory_item"."_last_updated_by_id", "PoliceInventory_item"."brand_id", "PoliceInventory_item"."type", (SELECT CONCAT(U0."name", \'\') AS "as_string" FROM "PoliceInventory_brand" U0 WHERE U0."id" = (PoliceInventory_item."brand_id") ORDER BY U0."_created" DESC) AS "brand_string" FROM "PoliceInventory_item" ORDER BY "PoliceInventory_item"."_created" DESC'
Note how the nested id comparison is performed with PoliceInventory, NOT "PoliceInventory" as it is referred to everywhere else. The following query works as expected:
'SELECT "PoliceInventory_item"."id", "PoliceInventory_item"."_created", "PoliceInventory_item"."_created_by_id", "PoliceInventory_item"."_last_updated", "PoliceInventory_item"."_last_updated_by_id", "PoliceInventory_item"."brand_id", "PoliceInventory_item"."type", (SELECT CONCAT(U0."name", \'\') AS "as_string" FROM "PoliceInventory_brand" U0 WHERE U0."id" = ("PoliceInventory_item"."brand_id") ORDER BY U0."_created" DESC) AS "brand_string" FROM "PoliceInventory_item" ORDER BY "PoliceInventory_item"."_created" DESC'
The problem seems pretty clearly to be OuterRef incorrectly failing to use the same table reference as the outer query, resulting in a mismatch. Does anyone know how I can persuade OuterRef to behave correctly?
I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
(Purchase.objects
.all()
.distinct('customer')
.order_by('customer', '-date'))
It produces a query like:
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this https://stackoverflow.com/a/9796104/242969. Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
Customer.objects.all().annotate(most_recent_purchase=Max('purchase__date'))
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
"demo_customer"."user_id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
"demo_customer"."user_id"
Another option, would be adding a property to your customer model that would look something like this:
#property
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id" "shop_purchase.id" "shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "shop_purchase.date" DESC;
) AS result
ORDER BY date DESC;
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("shop_purchase.date")
FROM shop_purchase
GROUP BY 1
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
SELECT *
FROM shop_customer
WHERE "shop_customer.id" IN
(SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
(Purchase.objects.filter(
id__in=Purchase.objects
.values('customer_id')
.annotate(latest=Max('id'))
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY c.id DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
SELECT
json_build_object('comments',
array_agg(
json_build_object('id', id, 'user_id', user_id, 'body', body)
)
)
FROM (
SELECT
c.id,
c.user_id,
c.body
FROM app_comment c
WHERE c.article_id = app_article.id
ORDER BY c.id DESC
LIMIT 3
) sub
"""
})
for article in qs:
print(article.recent_comments)
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....