Raw query with rank over subquery / params not quoted - django

My Goal
I need PostgreSQL's rank() window function applied to an annotated queryset from Django's ORM. Django's sql query has to be a subquery in order to apply the window function and this is what I'm doing so far:
queryset = Item.objects.annotate(…)
queryset_with_rank = Items.objects.raw("""
select rank() over (order by points), *
from (%(subquery)s)""", { 'subquery': queryset.query }
)
The problem
Unfortunately, the query returned by queryset.query does not quote the parameters used for annotation correctly although the query itself is executed perfectly fine.
Example of returned query
The query returned by queryset_with_rank.query or queryset.query returns the following
"participation"."category" = )
"participation"."category" = amateur)
which I rather expected to be
"participation"."category" = '')
"participation"."category" = 'amateur')
Question
I noticed that the Django documentation states the following about Query.__str__()
Parameter values won't necessarily be quoted correctly, since that is done by the database interface at execution time.
As long as I fix the quotation manually and pass it to Postgres myself, everything works as expected. Is there a way to receive the needed subquery with correct quotation? Or is there an alternative and better approach to applying a window function to a Django ORM queryset altoghether?

As Django core developer Aymeric Augustin said, there's no way to get the exact query that is executed by the database backend beforehand.
I still managed to build the query the way I hoped to, although a bit cumbersome:
# Obtain query and parameters separately
query, params = item_queryset.query.sql_with_params()
# Put additional quotes around string. I guess this is what
# the database adapter does as well.
params = [
'\'{}\''.format(p)
if isinstance(p, basestring) else p
for p in params
]
# Cast list of parameters to tuple because I got
# "not enough format characters" otherwise. Dunno why.
params = tuple(params)
participations = Item.objects.raw("""
select *,
rank() over (order by points DESC) as rank
from ({subquery}
""".format(subquery=query.format(params)), []
)

Related

How to get and update Django object in one query?

To optimize a lot my database I would like to make as less as possible any query.
I'm trying to get an object, increment the field "count_limit" and make an If statement after on the Customer instance.
To achieve it I've made this query who worked well.
Customer.objects.filter(user=user).update(count_limit=F('count_limit') + 1)
So after this query, count_limit has been incremented by 1 as I wanted.
When I'm trying to get the Customer instance as a result of this query, it returns "1".
Is it possible to make both, update the instance and get it as a return object ?
Thanks a lot
The update() method will return the number of updated rows. If you are using Postgres, then you can use the returning clause with the raw query.
query = 'UPDATE customer SET count_limit=(customer.count_limit + 1) WHERE customer.user_id=%s returning *'
updated_obj = Customer.objects.raw(query, [user.id])
I don't know if this can be achieved by ORM, but suggestions will be appreciated.
Make sure that the table name in raw query is correct. If you haven't definer db_table in the meta class of your model, then by default it will be myapp_model.
And to prevent SQL injection, from the Docs:
Do not use string formatting on raw queries or quote placeholders in
your SQL strings!
Follow Docs on raw()
You are looking for F functions: https://docs.djangoproject.com/en/3.0/ref/models/expressions/#f-expressions
Example from their documentation how to increase a counter
from django.db.models import F
reporter = Reporters.objects.get(name='Tintin')
reporter.stories_filed = F('stories_filed') + 1
reporter.save()

How do you unit test Django RawQuerySets

I'm using raw() and annotate() on a Django query and need some way to test it.
Here's what my code looks like (this is simplified significantly)
query = """SELECT
table1.id, table1.column1, table1.column2,
table2.other_column1, table2.other_column2
FROM myapp_mymodel as table1
JOIN otherapp_othermodel as table2 ON table1.othermodel_id = table2.id"""
return MyModel.objects.annotate(
other_column1=models.Value('other_column1', models.IntegerField()),
other_column2=models.Value('other_column2', models.DateField())
).raw(query)
It's relatively straightforward to fill the database with sample data, but what's the best way to check that the data is returned by this code?
There are a lot of options when dealing with standard querysets that seem to go out the window when dealing with RawQuerySets.
Usually the approach is to have the test set up a relatively small data set that contains some things the query should find, and some things it shouldn't find. Then inspect the returned QuerySet and verify:
That it contains the expected number of results
That the set of primary key values returned matches what you expected to return
That the annotated values on the returned objects are what they're expected to be.
So, for example, you might do something like:
def test_custom_query(self):
# Put whatever code you need here to insert the test data
results = run_your_query_here()
self.assertEqual(results.count(), expected_number_of_results)
self.assertEqual({obj.pk for obj in results}, set_of_expected_primary_keys)
self.assertEqual(
[obj.annotated_value for obj in results],
list_of_expected_annotated_values
)

Django Postgres ArrayField aggregation and filtering

Following on from this question: Django Postgresql ArrayField aggregation
I have an ArrayField of Categories and I would like to retrieve all unique values it has - however the results should be filtered so that only values starting with the supplied string are returned.
What's the "most Django" way of doing this?
Given an Animal model that looks like this:
class Animal(models.Model):
# ...
categories = ArrayField(
models.CharField(max_length=255, blank=True),
default=list,
)
# ...
Then, as per the other question's answer, this works for finding all categories, unfiltered.
all_categories = (
Animal.objects
.annotate(categories_element=Func(F('categories'), function='unnest'))
.values_list('categories_element', flat=True)
.distinct()
)
However, now, when I attempt to filter the result I get failure, not just with __startswith but all types of filter:
all_categories.filter(categories_element__startswith('ga'))
all_categories.filter(categories_element='dog')
Bottom of stacktrace is:
DataError: malformed array literal: "dog"
...
DETAIL: Array value must start with "{" or dimension information.
... and it appears that it's because Django tries to do a second UNNEST - this is the SQL it generates:
...) WHERE unnest("animal"."categories") = dog::text[]
If I write the query in PSQL then it appears to require a subquery as a result of the UNNEST:
SELECT categories_element
FROM (
SELECT UNNEST(animal.categories) as categories_element
) ul
WHERE ul.categories_element like 'Ga%';
Is there a way to get Django ORM to make a working query? Or should I just give up on the ORM and use raw SQL?
You probably have the wrong database design.
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.
http://www.postgresql.org/docs/9.1/static/arrays.html

Annotating a Django queryset with a left outer join?

Say I have a model:
class Foo(models.Model):
...
and another model that basically gives per-user information about Foo:
class UserFoo(models.Model):
user = models.ForeignKey(User)
foo = models.ForeignKey(Foo)
...
class Meta:
unique_together = ("user", "foo")
I'd like to generate a queryset of Foos but annotated with the (optional) related UserFoo based on user=request.user.
So it's effectively a LEFT OUTER JOIN on (foo.id = userfoo.foo_id AND userfoo.user_id = ...)
A solution with raw might look like
foos = Foo.objects.raw("SELECT foo.* FROM foo LEFT OUTER JOIN userfoo ON (foo.id = userfoo.foo_id AND foo.user_id = %s)", [request.user.id])
You'll need to modify the SELECT to include extra fields from userfoo which will be annotated to the resulting Foo instances in the queryset.
This answer might not be exactly what you are looking for but since its the first result in google when searching for "django annotate outer join" so I will post it here.
Note: tested on Djang 1.7
Suppose you have the following models
class User(models.Model):
name = models.CharField()
class EarnedPoints(models.Model):
points = models.PositiveIntegerField()
user = models.ForeignKey(User)
To get total user points you might do something like that
User.objects.annotate(points=Sum("earned_points__points"))
this will work but it will not return users who have no points, here we need outer join without any direct hacks or raw sql
You can achieve that by doing this
users_with_points = User.objects.annotate(points=Sum("earned_points__points"))
result = users_with_points | User.objects.exclude(pk__in=users_with_points)
This will be translated into OUTER LEFT JOIN and all users will be returned. users who has no points will have None value in their points attribute.
Hope that helps
Notice: This method does not work in Django 1.6+. As explained in tcarobruce's comment below, the promote argument was removed as part of ticket #19849: ORM Cleanup.
Django doesn't provide an entirely built-in way to do this, but it's not neccessary to construct an entirely raw query. (This method doesn't work for selecting * from UserFoo, so I'm using .comment as an example field to include from UserFoo.)
The QuerySet.extra() method allows us to add terms to the SELECT and WHERE clauses of our query. We use this to include the fields from UserFoo table in our results, and limit our UserFoo matches to the current user.
results = Foo.objects.extra(
select={"user_comment": "UserFoo.comment"},
where=["(UserFoo.user_id IS NULL OR UserFoo.user_id = %s)"],
params=[request.user.id]
)
This query still needs the UserFoo table. It would be possible to use .extras(tables=...) to get an implicit INNER JOIN, but for an OUTER JOIN we need to modify the internal query object ourself.
connection = (
UserFoo._meta.db_table, User._meta.db_table, # JOIN these tables
"user_id", "id", # on these fields
)
results.query.join( # modify the query
connection, # with this table connection
promote=True, # as LEFT OUTER JOIN
)
We can now evaluate the results. Each instance will have a .user_comment property containing the value from UserFoo, or None if it doesn't exist.
print results[0].user_comment
(Credit to this blog post by Colin Copeland for showing me how to do OUTER JOINs.)
I stumbled upon this problem I was unable to solve without resorting to raw SQL, but I did not want to rewrite the entire query.
Following is a description on how you can augment a queryset with an external raw sql, without having to care about the actual query that generates the queryset.
Here's a typical scenario: You have a reddit like site with a LinkPost model and a UserPostVote mode, like this:
class LinkPost(models.Model):
some fields....
class UserPostVote(models.Model):
user = models.ForeignKey(User,related_name="post_votes")
post = models.ForeignKey(LinkPost,related_name="user_votes")
value = models.IntegerField(null=False, default=0)
where the userpostvote table collect's the votes of users on posts.
Now you're trying to display the front page for a user with a pagination app, but you want the arrows to be red for posts the user has voted on.
First you get the posts for the page:
post_list = LinkPost.objects.all()
paginator = Paginator(post_list,25)
posts_page = paginator.page(request.GET.get('page'))
so now you have a QuerySet posts_page generated by the django paginator that selects the posts to display. How do we now add the annotation of the user's vote on each post before rendering it in a template?
Here's where it get's tricky and I was unable to find a clean ORM solution. select_related won't allow you to only get votes corresponding to the logged in user and looping over the posts would do bunch queries instead of one and doing it all raw mean's we can't use the queryset from the pagination app.
So here's how I do it:
q1 = posts_page.object_list.query # The query object of the queryset
q1_alias = q1.get_initial_alias() # This forces the query object to generate it's sql
(q1str, q1param) = q1.sql_with_params() #This gets the sql for the query along with
#parameters, which are none in this example
we now have the query for the queryset, and just wrap it, alias and left outer join to it:
q2_augment = "SELECT B.value as uservote, A.*
from ("+q1str+") A LEFT OUTER JOIN reddit_userpostvote B
ON A.id = B.post_id AND B.user_id = %s"
q2param = (request.user.id,)
posts_augmented = LinkPost.objects.raw(q2_augment,q1param+q2param)
voila! Now we can access post.uservote for a post in the augmented queryset.
And we just hit the database with a single query.
The two queries you suggest are as good as you're going to get (without using raw()), this type of query isn't representable in the ORM at present time.
You could do this using simonw's django-queryset-transform to avoid hard-coding a raw SQL query - the code would look something like this:
def userfoo_retriever(qs):
userfoos = dict((i.pk, i) for i in UserFoo.objects.filter(foo__in=qs))
for i in qs:
i.userfoo = userfoos.get(i.pk, None)
for foo in Foo.objects.filter(…).tranform(userfoo_retriever):
print foo.userfoo
This approach has been quite successful for this need and to efficiently retrieve M2M values; your query count won't be quite as low but on certain databases (cough MySQL cough) doing two simpler queries can often be faster than one with complex JOINs and many of the cases where I've most needed it had additional complexity which would have been even harder to hack into an ORM expression.
As for outerjoins:
Once you have a queryset qs from foo that includes a reference to columns from userfoo, you can promote the inner join to an outer join with
qs.query.promote_joins(["userfoo"])
You shouldn't have to resort to extra or raw for this.
The following should work.
Foo.objects.filter(
Q(userfoo_set__user=request.user) |
Q(userfoo_set=None) # This forces the use of LOUTER JOIN.
).annotate(
comment=F('userfoo_set__comment'),
# ... annotate all the fields you'd like to see added here.
)
The only way I see to do this without using raw etc. is something like this:
Foo.objects.filter(
Q(userfoo_set__isnull=True)|Q(userfoo_set__isnull=False)
).annotate(bar=Case(
When(userfoo_set__user_id=request.user, then='userfoo_set__bar')
))
The double Q trick ensures that you get your left outer join.
Unfortunately you can't set your request.user condition in the filter() since it may filter out successful joins on UserFoo instances with the wrong user, hence filtering out rows of Foo that you wanted to keep (which is why you ideally want the condition in the ON join clause instead of in the WHERE clause).
Because you can't filter out the rows that have an unwanted user value, you have to select rows from UserFoo with a CASE.
Note also that one Foo may join to many UserFoo records, so you may want to consider some way to retrieve distinct Foos from the output.
maparent's comment put me on the right way:
from django.db.models.sql.datastructures import Join
for alias in qs.query.alias_map.values():
if isinstance(alias, Join):
alias.nullable = True
qs.query.promote_joins(qs.query.tables)

django's .extra(where= clauses are clobbered by table-renaming .filter(foo__in=... subselects

The short of it is, the table names of all queries that are inside a filter get renamed to u0, u1, ..., so my extra where clauses won't know what table to point to. I would love to not have to hand-make all the queries for every way I might subselect on this data, and my current workaround is to turn my extra'd queries into pk values_lists, but those are really slow and something of an abomination.
Here's what this all looks like. You can mostly ignore the details of what goes in the extra of this manager method, except the first sql line which points to products_product.id:
def by_status(self, *statii):
return self.extra(where=["""products_product.id IN
(SELECT recent.product_id
FROM (
SELECT product_id, MAX(start_date) AS latest
FROM products_productstatus
GROUP BY product_id
) AS recent
JOIN products_productstatus AS ps ON ps.product_id = recent.product_id
WHERE ps.start_date = recent.latest
AND ps.status IN (%s))""" % (', '.join([str(stat) for stat in statii]),)])
Which works wonderfully for all the situations involving only the products_product table.
When I want these products as a subselect, i do:
Piece.objects.filter(
product__in=Product.objects.filter(
pk__in=list(
Product.objects.by_status(FEATURED).values_list('id', flat=True))))
How can I keep the generalized abilities of a query set, yet still use an extra where clause?
At first: the issue is not totally clear to me. Is the second code block in your question the actual code you want to execute? If this is the case the query should work as expected since there is no subselect performed.
I assume so that you want to use the second code block without the list() around the subselect to prevent a second query being performed.
The django documentation refers to this issue in the documentation about the extra method. However its not very easy to overcome this issue.
The easiest but most "hakish" solution is to observe which table alias is produced by django for the table you want to query in the extra method. You can rely on the persistent naming of this alias as long as you construct the query always in the same fashion (you don't change the order of multiple extra methods or filter calls that cause a join).
You can inspect a query that will be execute in the DB queryset by using:
print Model.objects.filter(...).query
This will reveal the aliases that are used for the tables you want to query.
As of Django 1.11, you should be able to use Subquery and OuterRef to generate an equivalent query to your extra (using a correlated subquery rather than a join):
def by_status(self, *statii):
return self.filter(
id__in=Subquery(ProductStatus.values("product_id").filter(
status__in=statii,
product__in=Subquery(ProductStatus.objects.values(
"product_id",
).annotate(
latest=Max("start_date"),
).filter(
latest=OuterRef("start_date"),
).values("product_id"),
),
)
You could probably do it with Window expressions as well (as of Django 2.0).
Note that this is untested, so may need some tweaks.