Doctrine 2 translate joined count SQL query to DQL - doctrine-orm

I have two entities, Video and Vote. Vote has a many-to-one relationship with Video, but Video has no relationship with Vote. I'm trying to retrieve a list of Videos sorted by vote count.
The following SQL works to get me what I want:
SELECT video.*, COUNT(video_id) AS vote_count
FROM video
LEFT JOIN vote ON vote.video_id = video.id
GROUP BY video.id
ORDER BY vote_count DESC;
I'm trying to achieve something similar with DQL, but no luck so far:
SELECT vid.name, COUNT(vote.video_id) as vote_count
FROM VideoVote\Video\Video vid
JOIN vote.video vid
GROUP BY video.id
ORDER BY vote_count DESC

It can't be done because ... video does not have a relation to vote. So your options are
Make the relationship bidirectional (i would go for this one). No penalty on the database, little extra overhead in PHP. (Test performance to see if it works for you).
Use your native SQL query and use result set mapping.

Thank you Flip. Once I added the one-to-many association from Video to Vote, the following DQL works:
SELECT vid, COUNT(vote.id) AS HIDDEN vote_count
FROM VideoVote\Video\Video vid
LEFT JOIN vid.votes vote
GROUP BY vid.id
ORDER BY vote_count DESC

Related

Solving a slow query with a Foreignkey that has isnull=False and order_by in a Django ListView

I have a Django ListView that allows to paginate through 'active' People.
The (simplified) models:
class Person(models.Model):
name = models.CharField()
# ...
active_schedule = models.ForeignKey('Schedule', related_name='+', null=True, on_delete=models.SET_NULL)
class Schedule(models.Model):
field = models.PositiveIntegerField(default=0)
# ...
person = models.ForeignKey(Person, related_name='schedules', on_delete=models.CASCADE)
The Person table contains almost 700.000 rows and the Schedule table contains just over 2.000.000 rows (on average every Person has 2-3 Schedule records, although many have none and a lot have more). For an 'active' Person, the active_schedule ForeignKey is set, of which there are about 5.000 at any time.
The ListView is supposed to show all active Person's, sorted by field on Schedule (and some other conditions, that don't seem to matter for this case).
The query then becomes:
Person.objects
.filter(active_schedule__isnull=False)
.select_related('active_schedule')
.order_by('active_schedule__field')
Specifically the order_by on the related field makes this query terribly slow (that is: it takes about a second, which is too slow for a web app).
I was hoping the filter condition would select the 5000 records, which then become relatively easily sortable. But when I run explain on this query, it shows that the (Postgres) database is messing with many more rows:
Gather Merge (cost=224316.51..290280.48 rows=565366 width=227)
Workers Planned: 2
-> Sort (cost=223316.49..224023.19 rows=282683 width=227)
Sort Key: exampledb_schedule.field
-> Parallel Hash Join (cost=89795.12..135883.20 rows=282683 width=227)
Hash Cond: (exampledb_person.active_schedule_id = exampledb_schedule.id)
-> Parallel Seq Scan on exampledb_person (cost=0.00..21263.03 rows=282683 width=161)
Filter: (active_schedule_id IS NOT NULL)
-> Parallel Hash (cost=67411.27..67411.27 rows=924228 width=66)
-> Parallel Seq Scan on exampledb_schedule (cost=0.00..67411.27 rows=924228 width=66)
I recently changed the models to be this way. In a previous version I had a model with just the ~5.000 active Person's in it. Doing the order_by on this small table was considerably faster! I am hoping to achieve the same speed with the current models.
I tried retrieving just the fields needed for the Listview (using values) which does help a little, but not much. I also tried setting the related_name on active_schedule and approaching the problem from Schedule, but that makes no difference. I tried putting a db_index on the Schedule.field, but that seems only to make things slower. Conditional queries also did not help (although I probably have not tried all possibilities). I'm at a loss.
The SQL statement generated by the ORM query:
SELECT
"exampledb_person"."id",
"exampledb_person"."name",
...
"exampledb_person"."active_schedule_id",
"exampledb_person"."created",
"exampledb_person"."updated",
"exampledb_schedule"."id",
"exampledb_schedule"."person_id",
"exampledb_schedule"."field",
...
"exampledb_schedule"."created",
"exampledb_schedule"."updated"
FROM
"exampledb_person"
INNER JOIN
"exampledb_schedule"
ON ("exampledb_person"."active_schedule_id" = "exampledb_schedule"."id")
WHERE
"exampledb_person"."active_schedule_id" IS NOT NULL
ORDER BY
"exampledb_schedule"."field" ASC
(Some fields were left out, for simplicity.)
Is it possible to speed up this query, or should I revert back to using a special Model for the active Person's?
EDIT: When I change the query, just for comparison/testing, to sort on an UNindexed field on Person, the query is equally show. However, if I then add an index to that field, the query is fast! I had to try this, as the SQL statement indeed shows that it's ordering on "exampledb_schedule"."field" - a field without index, but like I said: adding an index on the field makes no difference.
EDIT: I suppose it's also worth noting that when trying a much simpler sort query directly on Schedule, either on an indexed field or not, it's MUCH faster. For instance, for this test I've added an index to Schedule.field, then the following query is blazing fast:
Schedule.objects.order_by('field')
Somewhere in here lies the solution...
The comments by #guarav and my edits pointed me in the direction of the solution, which was staring in my face for a while...
The filter clause in my questions - filter(active_schedule__isnull=False) - seems to invalidate the database indexes. I wasn't aware of this, and had hoped a database expert would point me in this direction.
The solution is to filter on Schedule.field, which is 0 for inactive Person records and >0 for active ones:
Person.objects
.select_related('active_schedule')
.filter(active_schedule__field__gte=1)
.order_by('active_schedule__field')
This query properly uses the indexes and is fast (20ms opposed to ~1000ms).

How to join with sub-table using Django's ORM?

I've this query. Orders post records by last comment on post. This query works well with small tables. However, I've filled database with random data approximately 2M rows on comment table. Analyzed query with explain and saw that sequential scan is performed on Post table.
Post.objects.extra(select={'last_update': 'select max(c.create_date) from comment_comment c where c.post_id = post_post.id'}).order_by('-last_update')
I've rewritten same query which is faster than current one. But I could not find a way to fit the query on django's orm. How can I rewrite it? If it is possible, I want to write it not using raw query as much as possible.
Regards. Thanks for any help.
select
p.*,
t.last_update
from
post_post p
join
( select c.post_id as pid, max(c.create_date) as last_update from comment_comment c group by pid) t
on p.id = t.pid
order by t.last_update desc
limit 50;
If I make some assumptions about your Django model, it will look something like this:
posts.objects
.annotate(last_update=Max('comments__create_date'))
.order_by('-last_update')[:50]
In Django, annotate is your friend.

Count only published videos

I have a Category model and Video model
Category:
name=Charfield()
Video:
name=CharField()
category=ManyToManyField()
is_live=BooleanField()
And I want to have the get all categories with a video count but I want to exclude videos who are not live.
This my start state:
Category.objects.annotate(video_count=Count('video'))
# I tried this but I'm not sure if this the right way
Category.objects.exclude(video__is_liive=False)
Any Ideas?
If you want to filter the field you are annotating, you need to use raw SQL as you can't do it through the ORM yet. I wrote a blog post about this:
http://timmyomahony.com/blog/filtering-annotations-django/
Your situation is a little more complicated as you have a M2M relationship which uses an intermediate table. You need something like the following which joins all 3 tables and counts only those that are marked is_live=True (this is totally untested so you will need to play around with it)
categories = Category.objects.all().extra(select = {
"video_count" : """
SELECT COUNT(*)
FROM myapp_videocategory
JOIN myapp_videocategory on myapp_videocategory.category_id = myapp_category.id
JOIN myapp_video on myapp_videocategory.video_id = myapp_video.id
WHERE myapp_video.is_live = True
"""
}).order_by("-live_video_count",)

Filter for elements using exists through a reverse foreign key relationship

A relevant image of my model is here: http://i.stack.imgur.com/xzsVU.png
I need to make a queryset that contains all cats who have an associated person with a role of "owner" and a name of "bob".
The sql for this would be shown below.
select * from cat where exists
(select 1 from person inner join role where
person.name="bob" and role.name="owner");
This problem can be solved in two sql queries with the following django filters.
people = Person.objects.filter(name="bob", role__name="owner")
ids = [p.id for p in people]
cats = Cat.objects.filter(id__in=ids)
My actual setup is more complex than this and is dealing with a large dataset. Is there a way to do this with one query? If it is impossible, what is the efficient alternative?
I'm pretty sure this is your query:
cats = Cat.objects.filter(person__name='bob', person__role__name='owner')
read here about look ups spanning relationships

What is the internal function in django to add new tables to a queryset in a sensible way?

In django 1.2:
I have a queryset with an extra parameter which refers to a table which is not currently included in the query django generates for this queryset.
If I add an order_by to the queryset which refers to the other table, django adds joins to the other table in the proper way and the extra works. But without the order_by, the extra parameter is failing. I could just add a useless secondary order_by to something in the other table, but I think there should be a better way to do it.
What is the django function to add joins in a sensible way? I know this must be getting called somewhere.
Here is some sample code. It selects all readings for a given user, and annotates the results with the rating (if any) given by another user stored in 'friend'.
class Book(models.Model):
name = models.CharField(max_length=200)
urlname = models.CharField(max_length=200)
entrydate=models.DateTimeField(auto_now_add=True)
class Reading(models.Model):
book=models.ForeignKey(Book,related_name='readings')
user=models.ForeignKey(User)
rating=models.IntegerField()
entrydate=models.DateTimeField(auto_now_add=True)
readings=Reading.objects.filter(user=user).order_by('entrydate')
friendrating='(select rating from proj_reading where user_id=%d and \
book_id=proj_book.id and rating in (1,2,3,4,5,6))'%friend.id
readings=readings.extra(select={'friendrating':friendrating})
at the moment, readings won't work because the join to readings is not set up correctly. however, if I add an order by such as:
.order_by('entrydate','reading__entrydate')
django magically knows to add an inner join through the foreign key and I get what I want.
additional information:
print readings.query ==>
select ((select rating from proj_reading where user_id=2 and book_id=proj_book.id and rating in (1,2,3,4,5,6)) as 'hisrating', proj_reading.id, proj_reading.user_id, proj_reading.rating, proj_reading.entrydate from proj_reading where proj_reading.user_id=1;
assuming
user.id=1
friend.id=2
the error is:
OperationalError: Unknown column proj_book.id in 'where clause'
and it happens because the table proj_book is not included in the query. To restate what I said above - if I now do readings2=readings.order_by('book__entrydate') I can see the proper join is set up and the query works.
Ideally I'd just like to figure out what the name of the qs.query function is that looks at two tables and figures out how they are joined by foreign keys, and just call that manually.
Your generated query:
select ((select rating from proj_reading where user_id=2 and book_id=proj_book.id and rating in (1,2,3,4,5,6)) as 'hisrating', proj_reading.id, proj_reading.user_id, proj_reading.rating, proj_reading.entrydate from proj_reading where proj_reading.user_id=1;
The db has no way to understand what does it mean by proj_book, since it is not included in (from tables or inner join).
You are getting expected results, when you add order_by, because that order_by query is adding inner join between proj_book and proj_reading.
As far as I understand, if you refer any other column in Book, not just order_by, you will get similar results.
Q1 = Reading.objects.filter(user=user).exclude(Book__name='') # Exclude forces to add JOIN
Q2 = "Select rating from proj_reading where user_id=%d" % user.id
Result = Q1.extra("foo":Q2)
This way, at step Q1, you are forcing DJango to add join on Book table, which is not default, unless you access any field of Book table.
you mean:
class SomeModel(models.Model)
id = models.IntegerField()
...
class SomeOtherModel(models.Model)
otherfield = models.ForeignKey(SomeModel)
qrst = SomeOtherModel.objects.filter(otherfield__id=1)
You can use "__" to create table joins.
EDIT:
It wont work because you do not define table join correctly.
myrating='(select rating from proj_reading inner join proj_book on (proj_book.id=proj_reading_id) where proj_reading.user_id=%d and rating in (1,2,3,4,5,6))'%user.id)'
This is a pesdocode and it is not tested.
But, i advice you to use django filters instead of writing sql queries.
read = Reading.objects.filter(book__urlname__icontains="smith", user_id=user.id, rating__in=(1,2,3,4,5,6)).values('rating')
Documentation for more details.