Can Django do nested queries and exclusions - django

I need some help putting together this query in Django. I've simplified the example here to just cut right to the point.
MyModel(models.Model):
created = models.DateTimeField()
user = models.ForeignKey(User)
data = models.BooleanField()
The query I'd like to create in English would sound like:
Give me every record that was created yesterday for which data is False where in that same range data never appears as True for the given user
Here's an example input/output in case that wasn't clear.
Table Values
ID Created User Data
1 1/1/2010 admin False
2 1/1/2010 joe True
3 1/1/2010 admin False
4 1/1/2010 joe False
5 1/2/2010 joe False
Output Queryset
1 1/1/2010 admin False
3 1/1/2010 admin False
What I'm looking to do is to exclude record #4. The reason for this is because in the given range "yesterday", data appears as True once for the user in record #2, therefore that would exclude record #4.
In a sense, it almost seems like there are 2 queries taking place. One to determine the records in the given range, and one to exclude records which intersect with the "True" records.
How can I do this query with the Django ORM?

You don't need a nested query. You can generate a list of bad users' PKs and then exclude records containing those PKs in the next query.
bad = list(set(MyModel.obejcts.filter(data=True).values_list('user', flat=True)))
# list(set(list_object)) will remove duplicates
# not needed but might save the DB some work
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
# might not need the pk in user__pk__in - try it
You could condense that down into one line but I think that's as neat as you'll get. 2 queries isn't so bad.
Edit: You might wan to read the docs on this:
http://docs.djangoproject.com/en/dev/ref/models/querysets/#in
It makes it sound like it auto-nests the query (so only one query fires in the database) if it's like this:
bad = MyModel.objects.filter(data=True).values('pk')
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
But MySQL doesn't optimise this well so my code above (2 full queries) can actually end up running a lot faster.
Try both and race them!

looks like you could use:
from django.db.models import F
MyModel.objects.filter(datequery).filter(data=False).filter(data = F('data'))
F object available from version 1.0
Please, test it, I'm not sure.

Thanks to lazy evaluation, you can break your query up into a few different variables to make it easier to read. Here is some ./manage.py shell play time in the style that Oli already presented.
> from django.db import connection
> connection.queries = []
> target_day_qs = MyModel.objects.filter(created='2010-1-1')
> bad_users = target_day_qs.filter(data=True).values('user')
> result = target_day_qs.exclude(user__in=bad_users)
> [r.id for r in result]
[1, 3]
> len(connection.queries)
1
You could also say result.select_related() if you wanted to pull in the user objects in the same query.

Related

how does django query work?

my models are designed like so
class Warehouse:
name = ...
sublocation = FK(Sublocation)
class Sublocation:
name = ...
city = FK(City)
class City:
name = ..
state = Fk(State)
Now if i throw a query.
wh = Warehouse.objects.value_list(['name', 'sublocation__name',
'sublocation__city__name']).first()
it returns correct result but internally how many query is it throwing? is django fetching the data in one request?
Django makes only one query to the database for getting the data you described.
When you do:
wh = Warehouse.objects.values_list(
'name', 'sublocation__name', 'sublocation__city__name').first()
It translates in to this query:
SELECT "myapp_warehouse"."name", "myapp_sublocation"."name", "myapp_city"."name"
FROM "myapp_warehouse" INNER JOIN "myapp_sublocation"
ON ("myapp_warehouse"."sublocation_id" = "myapp_sublocation"."id")
INNER JOIN "myapp_city" ON ("myapp_sublocation"."city_id" = "myapp_city"."id")'
It gets the result in a single query. You can count number of queries in your shell like this:
from django.db import connection as c, reset_queries as rq
In [42]: rq()
In [43]: len(c.queries)
Out[43]: 0
In [44]: wh = Warehouse.objects.values_list('name', 'sublocation__name', 'sublocation__city__name').first()
In [45]: len(c.queries)
Out[45]: 1
My suggestion would be to write a test for this using assertNumQueries (docs here).
from django.test import TestCase
from yourproject.models import Warehouse
class TestQueries(TestCase):
def test_query_num(self):
"""
Assert values_list query executes 1 database query
"""
values = ['name', 'sublocation__name', 'sublocation__city__name']
with self.assertNumQueries(1):
Warehouse.objects.value_list(values).first()
FYI I'm not sure how many queries are indeed sent to the database, 1 is my current best guess. Adjust the number of queries expected to get this to pass in your project and pin the requirement.
There is extensive documentation on how and when querysets are evaluated in Django docs: QuerySet API Reference.
The pretty much standard way to have a good insight of how many and which queries are taken place during a page render is to use the Django Debug Toolbar. This could tell you precisely how many times this recordset is evaluated.
You can use django-debug-toolbar to see real queries to db

Django: Ordering a QuerySet based on a latest child models field

Lets assume I want to show a list of runners ordered by their latest sprint time.
class Runner(models.Model):
name = models.CharField(max_length=255)
class Sprint(models.Model):
runner = models.ForeignKey(Runner)
time = models.PositiveIntegerField()
created = models.DateTimeField(auto_now_add=True)
This is a quick sketch of what I would do in SQL:
SELECT runner.id, runner.name, sprint.time
FROM runner
LEFT JOIN sprint ON (sprint.runner_id = runner.id)
WHERE
sprint.id = (
SELECT sprint_inner.id
FROM sprint as sprint_inner
WHERE sprint_inner.runner_id = runner.id
ORDER BY sprint_inner.created DESC
LIMIT 1
)
OR sprint.id = NULL
ORDER BY sprint.time ASC
The Django QuerySet documentation states:
It is permissible to specify a multi-valued field to order the results
by (for example, a ManyToManyField field). Normally this won’t be a
sensible thing to do and it’s really an advanced usage feature.
However, if you know that your queryset’s filtering or available data
implies that there will only be one ordering piece of data for each of
the main items you are selecting, the ordering may well be exactly
what you want to do. Use ordering on multi-valued fields with care and
make sure the results are what you expect.
I guess I need to apply some filter here, but I'm not sure what exactly Django expects...
One note because it is not obvious in this example: the Runner table will have several hundred entries, the sprints will also have several hundreds and in some later days probably several thousand entries. The data will be displayed in a paginated view, so sorting in Python is not an option.
The only other possibility I see is writing the SQL myself, but I'd like to avoid this at all cost.
I don't think there's a way to do this via the ORM with only one query, you could grab a list of runners and use annotate to add their latest sprint id's -- then filter and order those sprints.
>>> from django.db.models import Max
# all runners now have a `last_race` attribute,
# which is the `id` of the last sprint they ran
>>> runners = Runner.objects.annotate(last_race=Max("sprint__id"))
# a list of each runner's last sprint ordered by the the sprint's time,
# we use `select_related` to limit lookup queries later on
>>> results = Sprint.objects.filter(id__in=[runner.last_race for runner in runners])
... .order_by("time")
... .select_related("runner")
# grab the first result
>>> first_result = results[0]
# you can access the runner's details via `.runner`, e.g. `first_result.runner.name`
>>> isinstance(first_result.runner, Runner)
True
# this should only ever execute 2 queries, no matter what you do with the results
>>> from django.db import connection
>>> len(connection.queries)
2
This is pretty fast and will still utilize the databases's indices and caching.
A few thousand records isn't all that much, this should work pretty well for those kinds of numbers. If you start running into problems, I suggest you bite the bullet and use raw SQL.
def view_name(request):
spr = Sprint.objects.values('runner', flat=True).order_by(-created).distinct()
runners = []
for s in spr:
latest_sprint = Sprint.objects.filter(runner=s.runner).order_by(-created)[:1]
for latest in latest_sprint:
runners.append({'runner': s.runner, 'time': latest.time})
return render(request, 'page.html', {
'runners': runners,
})
{% for runner in runners %}
{{runner.runner}} - {{runner.time}}
{% endfor %}

Django: ManyToMany filter matching on ALL items in a list

I have such a Book model:
class Book(models.Model):
authors = models.ManyToManyField(Author, ...)
...
In short:
I'd like to retrieve the books whose authors are strictly equal to a given set of authors. I'm not sure if there is a single query that does it, but any suggestions will be helpful.
In long:
Here is what I tried, (that failed to run getting an AttributeError)
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space,
# first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
final_books = QuerySet()
for author in target_authors:
temp_books = candidate_books.filter(authors__in=[author])
final_books = final_books and temp_books
... and here is what I got:
AttributeError: 'NoneType' object has no attribute '_meta'
In general, how should I query a model with the constraint that its ManyToMany field contains a set of given objects as in my case?
ps: I found some relevant SO questions but couldn't get a clear answer. Any good pointer will be helpful as well. Thanks.
Similar to #goliney's approach, I found a solution. However, I think the efficiency could be improved.
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space, first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
candidate_books = candidate_books.filter(authors=author)
final_books = candidate_books
You can use complex lookups with Q objects
from django.db.models import Q
...
target_authors = set((author_1, author_2))
q = Q()
for author in target_authors:
q &= Q(authors=author)
Books.objects.annotate(c=Count('authors')).filter(c=len(target_authors)).filter(q)
Q() & Q() is not equal to .filter().filter(). Their raw SQLs are different where by using Q with &, its SQL just add a condition like WHERE "book"."author" = "author_1" and "book"."author" = "author_2". it should return empty result.
The only solution is just by chaining filter to form a SQL with inner join on same table: ... ON ("author"."id" = "author_book"."author_id") INNER JOIN "author_book" T4 ON ("author"."id" = T4."author_id") WHERE ("author_book"."author_id" = "author_1" AND T4."author_id" = "author_1")
I came across the same problem and came to the same conclusion as iuysal,
untill i had to do a medium sized search (with 1000 records with 150 filters my request would time out).
In my particular case the search would result in no records since the chance that a single record will align with ALL 150 filters is very rare, you can get around the performance issues by verifying that there are records in the QuerySet before applying more filters to save time.
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
if candidate_books.count() > 0:
candidate_books = candidate_books.filter(authors=author)
For some reason Django applies filters to empty QuerySets.
But if optimization is to be applied correctly however, using a prepared QuerySet and correctly applied indexes are necessary.

Latest post by different users

Post has user and date attributes.
How can I turn this
posts = Post.objects.order_by('-date')[:30]
to give me 30 or less posts consisting of the last post by every user?
For example if I have 4 posts stored, 3 are from post.user="Benny" and 1 from post.user="Catherine" it should return 1 post from Benny and 1 from Catherine ordered by date.
I would probably use annotate, you might be able to use extra to get down to a single query.
posts = []
for u in User.objects.annotate(last_post=Max('post__date')).order_by('-last_post')[:30]:
posts.append(u.post_set.latest('date'))
You could also use raw, which would let you write a SQL query, but still return model instances. For instance:
sql = """
SELECT * FROM app_post
WHERE app_post.date IN
(SELECT MAX(app_post.date) FROM app_post
GROUP BY app_post.user_id)
ORDER BY app_post.date DESC
"""
posts = list(Post.objects.raw(sql))
untested but you might be able to do this
from django.db.models import Max
Post.objects.all().annotate(user=Max('date'))

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()