how does django query work? - django

my models are designed like so
class Warehouse:
name = ...
sublocation = FK(Sublocation)
class Sublocation:
name = ...
city = FK(City)
class City:
name = ..
state = Fk(State)
Now if i throw a query.
wh = Warehouse.objects.value_list(['name', 'sublocation__name',
'sublocation__city__name']).first()
it returns correct result but internally how many query is it throwing? is django fetching the data in one request?

Django makes only one query to the database for getting the data you described.
When you do:
wh = Warehouse.objects.values_list(
'name', 'sublocation__name', 'sublocation__city__name').first()
It translates in to this query:
SELECT "myapp_warehouse"."name", "myapp_sublocation"."name", "myapp_city"."name"
FROM "myapp_warehouse" INNER JOIN "myapp_sublocation"
ON ("myapp_warehouse"."sublocation_id" = "myapp_sublocation"."id")
INNER JOIN "myapp_city" ON ("myapp_sublocation"."city_id" = "myapp_city"."id")'
It gets the result in a single query. You can count number of queries in your shell like this:
from django.db import connection as c, reset_queries as rq
In [42]: rq()
In [43]: len(c.queries)
Out[43]: 0
In [44]: wh = Warehouse.objects.values_list('name', 'sublocation__name', 'sublocation__city__name').first()
In [45]: len(c.queries)
Out[45]: 1

My suggestion would be to write a test for this using assertNumQueries (docs here).
from django.test import TestCase
from yourproject.models import Warehouse
class TestQueries(TestCase):
def test_query_num(self):
"""
Assert values_list query executes 1 database query
"""
values = ['name', 'sublocation__name', 'sublocation__city__name']
with self.assertNumQueries(1):
Warehouse.objects.value_list(values).first()
FYI I'm not sure how many queries are indeed sent to the database, 1 is my current best guess. Adjust the number of queries expected to get this to pass in your project and pin the requirement.

There is extensive documentation on how and when querysets are evaluated in Django docs: QuerySet API Reference.
The pretty much standard way to have a good insight of how many and which queries are taken place during a page render is to use the Django Debug Toolbar. This could tell you precisely how many times this recordset is evaluated.

You can use django-debug-toolbar to see real queries to db

Related

Django count group by date from datetime

I'm trying to count the dates users register from a DateTime field. In the database this is stored as '2016-10-31 20:49:38' but I'm only interested in the date '2016-10-31'.
The raw SQL query is:
select DATE(registered_at) registered_date,count(registered_at) from User
where course='Course 1' group by registered_date;
It is possible using 'extra' but I've read this is deprecated and should not be done. It works like this though:
User.objects.all()
.filter(course='Course 1')
.extra(select={'registered_date': "DATE(registered_at)"})
.values('registered_date')
.annotate(**{'total': Count('registered_at')})
Is it possible to do without using extra?
I read that TruncDate can be used and I think this is the correct queryset however it does not work:
User.objects.all()
.filter(course='Course 1')
.annotate(registered_date=TruncDate('registered_at'))
.values('registered_date')
.annotate(**{'total': Count('registered_at')})
I get <QuerySet [{'total': 508346, 'registered_date': None}]> so there is something going wrong with TruncDate.
If anyone understands this better than me and can point me in the right direction that would be much appreciated.
Thanks for your help.
I was trying to do something very similar and was having the same problems as you. I managed to get my problem working by adding in an order_by clause after applying the TruncDate annotation. So I imagine that this should work for you too:
User.objects.all()
.filter(course='Course 1')
.annotate(registered_date=TruncDate('registered_at'))
.order_by('registered_date')
.values('registered_date')
.annotate(**{'total': Count('registered_at')})
Hope this helps?!
This is an alternative to using TruncDate by using `registered_at__date' and Django does the truncate for you.
from django.db.models import Count
from django.contrib.auth import get_user_model
metrics = {
'total': Count('registered_at__date')
}
get_user_model().objects.all()
.filter(course='Course 1')
.values('registered_at__date')
.annotate(**metrics)
.order_by('registered_at__date')
For Postgresql this transforms to the DB query:
SELECT
("auth_user"."registered_at" AT TIME ZONE 'Asia/Kolkata')::date,
COUNT("auth_user"."registered_at") AS "total"
FROM
"auth_user"
GROUP BY
("auth_user"."registered_at" AT TIME ZONE 'Asia/Kolkata')::date
ORDER BY
("auth_user"."registered_at" AT TIME ZONE 'Asia/Kolkata')::date ASC;
From the above example you can see that Django ORM reverses SELECT and GROUP_BY arguments. In Django ORM .values() roughly controls the GROUP_BY argument while .annotate() controls the SELECT columns and what aggregations needs to be done. This feels a little odd but is simple when you get the hang of it.

Django: Ordering a QuerySet based on a latest child models field

Lets assume I want to show a list of runners ordered by their latest sprint time.
class Runner(models.Model):
name = models.CharField(max_length=255)
class Sprint(models.Model):
runner = models.ForeignKey(Runner)
time = models.PositiveIntegerField()
created = models.DateTimeField(auto_now_add=True)
This is a quick sketch of what I would do in SQL:
SELECT runner.id, runner.name, sprint.time
FROM runner
LEFT JOIN sprint ON (sprint.runner_id = runner.id)
WHERE
sprint.id = (
SELECT sprint_inner.id
FROM sprint as sprint_inner
WHERE sprint_inner.runner_id = runner.id
ORDER BY sprint_inner.created DESC
LIMIT 1
)
OR sprint.id = NULL
ORDER BY sprint.time ASC
The Django QuerySet documentation states:
It is permissible to specify a multi-valued field to order the results
by (for example, a ManyToManyField field). Normally this won’t be a
sensible thing to do and it’s really an advanced usage feature.
However, if you know that your queryset’s filtering or available data
implies that there will only be one ordering piece of data for each of
the main items you are selecting, the ordering may well be exactly
what you want to do. Use ordering on multi-valued fields with care and
make sure the results are what you expect.
I guess I need to apply some filter here, but I'm not sure what exactly Django expects...
One note because it is not obvious in this example: the Runner table will have several hundred entries, the sprints will also have several hundreds and in some later days probably several thousand entries. The data will be displayed in a paginated view, so sorting in Python is not an option.
The only other possibility I see is writing the SQL myself, but I'd like to avoid this at all cost.
I don't think there's a way to do this via the ORM with only one query, you could grab a list of runners and use annotate to add their latest sprint id's -- then filter and order those sprints.
>>> from django.db.models import Max
# all runners now have a `last_race` attribute,
# which is the `id` of the last sprint they ran
>>> runners = Runner.objects.annotate(last_race=Max("sprint__id"))
# a list of each runner's last sprint ordered by the the sprint's time,
# we use `select_related` to limit lookup queries later on
>>> results = Sprint.objects.filter(id__in=[runner.last_race for runner in runners])
... .order_by("time")
... .select_related("runner")
# grab the first result
>>> first_result = results[0]
# you can access the runner's details via `.runner`, e.g. `first_result.runner.name`
>>> isinstance(first_result.runner, Runner)
True
# this should only ever execute 2 queries, no matter what you do with the results
>>> from django.db import connection
>>> len(connection.queries)
2
This is pretty fast and will still utilize the databases's indices and caching.
A few thousand records isn't all that much, this should work pretty well for those kinds of numbers. If you start running into problems, I suggest you bite the bullet and use raw SQL.
def view_name(request):
spr = Sprint.objects.values('runner', flat=True).order_by(-created).distinct()
runners = []
for s in spr:
latest_sprint = Sprint.objects.filter(runner=s.runner).order_by(-created)[:1]
for latest in latest_sprint:
runners.append({'runner': s.runner, 'time': latest.time})
return render(request, 'page.html', {
'runners': runners,
})
{% for runner in runners %}
{{runner.runner}} - {{runner.time}}
{% endfor %}

How to 'bulk update' with Django?

I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create

Aggregate difference between DateTime fields in Django

I have a table containing a series of entries which relate to time periods (specifically, time worked for a client):
task_time:
id | start_time | end_time | client (fk)
1 08/12/2011 14:48 08/12/2011 14:50 2
I am trying to aggregate all the time worked for a given client, from my Django app:
time_worked_aggregate = models.TaskTime.objects.\
filter(client = some_client_id).\
extra(select = {'elapsed': 'SUM(task_time.end_time - task_time.start_time)'}).\
values('elapsed')
if len(time_worked_aggregate) > 0:
time_worked = time_worked_aggregate[0]['elapsed'].total_seconds()
else:
time_worked = 0
This seems inelegant, but it does work. Or at least so I thought: it turns out that it works fine on a PostgreSQL database, but when I move over to SQLite, everything dies.
A bit of digging suggests that the reason for this is that DateTimes aren't first-class data in SQLite. The following raw SQLite query will do my job:
SELECT SUM(strftime('%s', end_time) - strftime('%s', start_time)) FROM task_time WHERE ...;
My question is as follows:
The Python sample above seems roundabout. Can we do this more elegantly?
More importantly at this stage, can we do it in a way that will work on both Postgres and SQLite? Ideally, I'd like not to be writing raw SQL queries and switching on the database backend that happens to be in place; in general, Django is extremely good at protecting us from this. Does Django have a reasonable abstraction for this operation? If not, what's a sensible way for me to do a conditional switch on the backend?
I should mention for context that the dataset is many thousands of entries; the following is not really practical:
sum([task_time.end_date - task_time.start_date for task_time in models.TaskTime.objects.filter(...)])
Almost the same solution as #andri proposed. In the final result you will get the same data.
ExpressionWrapper - New in Django 1.8.
from datetime import timedelta
from django.db.models import ExpressionWrapper, F, fields
from app.models import MyModel
duration = ExpressionWrapper(F('closed_at') - F('opened_at'), output_field=fields.DurationField())
objects = MyModel.objects.closed().annotate(duration=duration).filter(duration__gt=timedelta(seconds=2))
for obj in objects:
print obj.id, obj.duration, obj.duration.seconds
# sample output
# 807 0:00:57.114017 57
# 800 0:01:23.879478 83
# 804 3:40:06.797188 13206
# 801 0:02:06.786300 126
I think since Django 1.8 we can do better:
I would like just to draw the part with annotation, the further part with aggregation should be straightforward:
from django.db.models import F, Func
SomeModel.objects.annotate(
duration = Func(F('end_date'), F('start_date'), function='age')
)
[more about postgres age function here: http://www.postgresql.org/docs/8.4/static/functions-datetime.html ]
each instance of SomeModel will be anotated with duration field containg time difference, which in python will be a datetime.timedelta() object [more about datetime timedelta here: https://docs.python.org/2/library/datetime.html#timedelta-objects ]
I will do it step by step:
first step:annotate the timedelta
group by and sum timedelta
the code like this:
from django.db.models import Count, Sum, F
times_obj_list = models.TaskTime.objects.annotate(times=F("end_time")-F("start_time"))
groupby_obj_list = times_obj_list.values("client").annotate(cnt=Count("id"),seconds=Sum(times)).order_by()
Django currently only supports aggregates for Min, Max, Avg and Count, so using raw SQL is the only way to achieve what you want. When you use raw SQL, database-independence is out the window, so unfortunately, you're out of luck. You'll have to just detect the database and alter the SQL appropriately.

Can Django do nested queries and exclusions

I need some help putting together this query in Django. I've simplified the example here to just cut right to the point.
MyModel(models.Model):
created = models.DateTimeField()
user = models.ForeignKey(User)
data = models.BooleanField()
The query I'd like to create in English would sound like:
Give me every record that was created yesterday for which data is False where in that same range data never appears as True for the given user
Here's an example input/output in case that wasn't clear.
Table Values
ID Created User Data
1 1/1/2010 admin False
2 1/1/2010 joe True
3 1/1/2010 admin False
4 1/1/2010 joe False
5 1/2/2010 joe False
Output Queryset
1 1/1/2010 admin False
3 1/1/2010 admin False
What I'm looking to do is to exclude record #4. The reason for this is because in the given range "yesterday", data appears as True once for the user in record #2, therefore that would exclude record #4.
In a sense, it almost seems like there are 2 queries taking place. One to determine the records in the given range, and one to exclude records which intersect with the "True" records.
How can I do this query with the Django ORM?
You don't need a nested query. You can generate a list of bad users' PKs and then exclude records containing those PKs in the next query.
bad = list(set(MyModel.obejcts.filter(data=True).values_list('user', flat=True)))
# list(set(list_object)) will remove duplicates
# not needed but might save the DB some work
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
# might not need the pk in user__pk__in - try it
You could condense that down into one line but I think that's as neat as you'll get. 2 queries isn't so bad.
Edit: You might wan to read the docs on this:
http://docs.djangoproject.com/en/dev/ref/models/querysets/#in
It makes it sound like it auto-nests the query (so only one query fires in the database) if it's like this:
bad = MyModel.objects.filter(data=True).values('pk')
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
But MySQL doesn't optimise this well so my code above (2 full queries) can actually end up running a lot faster.
Try both and race them!
looks like you could use:
from django.db.models import F
MyModel.objects.filter(datequery).filter(data=False).filter(data = F('data'))
F object available from version 1.0
Please, test it, I'm not sure.
Thanks to lazy evaluation, you can break your query up into a few different variables to make it easier to read. Here is some ./manage.py shell play time in the style that Oli already presented.
> from django.db import connection
> connection.queries = []
> target_day_qs = MyModel.objects.filter(created='2010-1-1')
> bad_users = target_day_qs.filter(data=True).values('user')
> result = target_day_qs.exclude(user__in=bad_users)
> [r.id for r in result]
[1, 3]
> len(connection.queries)
1
You could also say result.select_related() if you wanted to pull in the user objects in the same query.