Django: Ordering a QuerySet based on a latest child models field

Django: Ordering a QuerySet based on a latest child models field - django

Lets assume I want to show a list of runners ordered by their latest sprint time.
class Runner(models.Model):
name = models.CharField(max_length=255)
class Sprint(models.Model):
runner = models.ForeignKey(Runner)
time = models.PositiveIntegerField()
created = models.DateTimeField(auto_now_add=True)
This is a quick sketch of what I would do in SQL:
SELECT runner.id, runner.name, sprint.time
FROM runner
LEFT JOIN sprint ON (sprint.runner_id = runner.id)
WHERE
sprint.id = (
SELECT sprint_inner.id
FROM sprint as sprint_inner
WHERE sprint_inner.runner_id = runner.id
ORDER BY sprint_inner.created DESC
LIMIT 1
)
OR sprint.id = NULL
ORDER BY sprint.time ASC
The Django QuerySet documentation states:
It is permissible to specify a multi-valued field to order the results
by (for example, a ManyToManyField field). Normally this won’t be a
sensible thing to do and it’s really an advanced usage feature.
However, if you know that your queryset’s filtering or available data
implies that there will only be one ordering piece of data for each of
the main items you are selecting, the ordering may well be exactly
what you want to do. Use ordering on multi-valued fields with care and
make sure the results are what you expect.
I guess I need to apply some filter here, but I'm not sure what exactly Django expects...
One note because it is not obvious in this example: the Runner table will have several hundred entries, the sprints will also have several hundreds and in some later days probably several thousand entries. The data will be displayed in a paginated view, so sorting in Python is not an option.
The only other possibility I see is writing the SQL myself, but I'd like to avoid this at all cost.

I don't think there's a way to do this via the ORM with only one query, you could grab a list of runners and use annotate to add their latest sprint id's -- then filter and order those sprints.
>>> from django.db.models import Max
# all runners now have a `last_race` attribute,
# which is the `id` of the last sprint they ran
>>> runners = Runner.objects.annotate(last_race=Max("sprint__id"))
# a list of each runner's last sprint ordered by the the sprint's time,
# we use `select_related` to limit lookup queries later on
>>> results = Sprint.objects.filter(id__in=[runner.last_race for runner in runners])
... .order_by("time")
... .select_related("runner")
# grab the first result
>>> first_result = results[0]
# you can access the runner's details via `.runner`, e.g. `first_result.runner.name`
>>> isinstance(first_result.runner, Runner)
True
# this should only ever execute 2 queries, no matter what you do with the results
>>> from django.db import connection
>>> len(connection.queries)
2
This is pretty fast and will still utilize the databases's indices and caching.
A few thousand records isn't all that much, this should work pretty well for those kinds of numbers. If you start running into problems, I suggest you bite the bullet and use raw SQL.

def view_name(request):
spr = Sprint.objects.values('runner', flat=True).order_by(-created).distinct()
runners = []
for s in spr:
latest_sprint = Sprint.objects.filter(runner=s.runner).order_by(-created)[:1]
for latest in latest_sprint:
runners.append({'runner': s.runner, 'time': latest.time})
return render(request, 'page.html', {
'runners': runners,
})
{% for runner in runners %}
{{runner.runner}} - {{runner.time}}
{% endfor %}

Related

How can I make this queryset more tolerable?

I have the following class:
class Instance(models.Model):
products = models.ManyToManyField(Product, blank=True)
class Product(models.Model):
description = HTMLField(blank=True, null=True)
short_description = HTMLField(blank=True, null=True)
And this form that I use to update Instances
class InstanceModelForm(InstanceValidatorMixin, UpdateInstanceLastUpdatedMixin, forms.ModelForm):
class Meta:
model = Instance
products = forms.ModelMultipleChoiceField(required=False, queryset=Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
My instance-product table is sizable (~ 1000 rows) and ever since I've added the queryset for products I am seeing web requests that are timing out due heroku's 30 second request limit.
My goal is to do something to this queryset such that my users are no longer timing out.
I have the following insights:
Accuracy doesn't matter as much to me - It doesn't have to be very accurate. Yes I would like to sort products by the count of instances this product has linked to but if it's off by 5 or 10 it doesn't really matter
that much.
Limited number of products - When my users are selecting products to be linked to an instance, they are primarily interested in products with less than 10 total linkages to instances. I don't know if a partial query will be accurate, but if this is possible I am open to trying.
Effort - I know there are frameworks out there that I can install to cache many things. I am looking for something that is light weight and requires less than 1 hr to get up and running.

First I would want to ensure that the performance issue actually comes from the query. I've tried to reproduce your problem:
>>> Instance.objects.count()
102499
>>> Product.objects.count()
1000
>>> sum(p.instance_set.count() for p in Product.objects.all())/Product.objects.count()
273.084
>>> list(Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
[...]
>>> from django.db import connection
>>> connection.queries[-1]
{'sql': 'SELECT "products_product"."id", "products_product"."description", "products_product"."short_description", COUNT("products_instance_products"."instance_id") AS "i_count" FROM "products_product" LEFT OUTER JOIN "products_instance_products" ON ("products_product"."id" = "products_instance_products"."product_id") GROUP BY "products_product"."id", "products_product"."description", "products_product"."short_description" ORDER BY "i_count" ASC', 'time': '0.189'}
By accident, I created a dataset that is probably quite a bit bigger than yours. As you can see, I have 1000 Products with an average of ~273 related Instances, but the query still takes less than a second (both on SQLite and PostgreSQL).
Use a one-off dyno with heroku run bash and check if you get the same numbers.
My guess is that your performance issues are either caused by
an n+1 query, where an extra query is made for each Product, e.g. in your Product.__str__ method.
the actual rendering of the MultipleChoiceField field. By default, it will render as a <select> with an <option> for each Product. This can be quite slow, and even it wasn't, it would pretty inconvenient to use. You might want to use a different widget, like django-select2.

Django ORM: django aggregate over filtered reverse relation

The question is remotely related to Django ORM: filter primary model based on chronological fields from related model, by further limiting the resulting queryset.
The models
Assuming we have the following models:
class Patient(models.Model)
name = models.CharField()
# other fields following
class MedicalFile(model.Model)
patient = models.ForeignKey(Patient, related_name='files')
issuing_date = models.DateField()
expiring_date = models.DateField()
diagnostic = models.CharField()
The query
I need to select all the files which are valid at a specified date, most likely from the past. The problem that I have here is that for every patient, there will be a small overlapping period where a patient will have 2 valid files. If we're querying for a date from that small timeframe, I need to select only the most recent file.
More to the point: consider patient John Doe. he will have string of "uninterrupted" files starting with 2012 like this:
+---+------------+-------------+
|ID |issuing_date|expiring_date|
+---+------------+-------------+
|1 |2012-03-06 |2013-03-06 |
+---+------------+-------------+
|2 |2013-03-04 |2014-03-04 |
+---+------------+-------------+
|3 |2014-03-04 |2015-03-04 |
+---+------------+-------------+
As one can easily observe, there is an overlap of couple of days of the validity of these files. For instance, in 2013-03-05 the files 1 and 2 are valid, but we're considering only file 2 (as the most recent one). I'm guessing that the use case isn't special: this is the case of managing subscriptions, where in order to have a continuous subscription, you will renew your subscription earlier.
Now, in my application I need to query historical data, e.g. give me all the files which where valid at 2013-03-05, considering only the "most recent" ones. I was able to solve this by using RawSQL, but I would like to have a solution without raw SQL. In the previous question, we were able to filter the "latest" file by aggregation over the reverse relation, something like:
qs = MedicalFile.objects.annotate(latest_file_date=Max('patient__files__issuing_date'))
qs = qs.filter(issuing_date=F('latest_file_date')).select_related('patient')
The problem is that we need to limit the range over which latest_file_date is computed, by filtering against 2013-03-05. But aggregate function don't run over filtered querysets ...
The "poor" solution
I'm currently doing this via an extra queryset clause (substitute "app" with your concrete application):
reference_date = datetime.date(year=2013, month=3, day=5)
annotation_latest_issuing_date = {
'latest_issuing_date': RawSQL('SELECT max(file.issuing_date) '
'FROM <app>_medicalfile file '
'WHERE file.person_id = <app>_medicalfile.person_id '
' AND file.issuing_date <= %s', (reference_date, ))
}
qs = MedicalFile.objects.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date)
qs = qs.extra(**annotation_latest_issuing_date).filter(issuing_date=F('latest_issuing_date'))
Writen as such, the queryset returns correct number of records.
Question: how can it be achieved without RaWSQL and (already implied) with the same performance level ?

You can use id__in and provide your nested filtered queryset (like all files that are valid at the given date).
qs = MedicalFile.objects
.filter(id__in=self.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date))
.order_by('patient__pk', '-issuing_date')
.distinct('patient__pk') # field_name parameter only supported by Postgres
The order_by groups the files by patient, with the latest issuing date first. distinct then retrieves that first file for each patient. However, general care is required when combining order_by and distinct: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.distinct
Edit: Removed single patient dependence from first filter and changed latest to combination of order_by and distinct

Consider p is a Patient class instance.
I think you can do someting like:
p.files.filter(issue_date__lt='some_date', expiring_date__gt='some_date')
See https://docs.djangoproject.com/en/1.9/topics/db/queries/#backwards-related-objects
Or maybe with the Q magic query object...

how to get latest foreign key value in models.py

I have a little problem with getting latest foreign key value in my django app. Here are my two models:
class Stock(models.Model):
...
class Dividend(models.Model):
date = models.DateField('pay date')
stock = models.ForeignKey(Stock, related_name="dividends")
class Meta:
ordering = ["date"]
I would like to get latest dividend from stock object. So basically this - stock.dividends.latest('date'). However, everytime I call stock.dividends.latest('date'), it fires up sql query to get latest dividend. I have latest() method in for cycle for every stock I have. I would like to avoid these sql queries. May I somehow define new method in class Stock that would get latest dividend within sql query for stock object?
I cannot change default ordering from "date" to "-date".
Using select_related('dividends') loads dividends objects with stock, but latest probably uses order_by and it requires sql query anyway. :(
EDIT1: To make more clear what I want, here is an example. Let's say I have 100 symbols in shares.keys():
for stock in Stock.objects.filter(symbol__in=shares.keys()): # 1 sql query
latest_dividend = stock.dividends.latest('date') # 100 sql queries
... #do something with latest dividend
Well and in some cases I might have 500 symbols in shares.keys(). That is why I need to avoid making sql queries on getting latest dividend for stock.

I have the same problem with you, so I tested many Django queries. Finally, I found out that we can use this:
Stock.objects.all().annotate(latest_date=Max('dividends__date')).filter(dividends__date=F('latest_date')).values('dividends')

I'm not sure my solution is the best, but here it is (works only with PostgreSQL):
stocks = list(Stock.objects.filter(**something))
dividends = Dividend.objects.filter(
stock__in=stocks,
).order_by(
'stock_id',
'-date'
).distinct(
'stock_id',
)
dividends_dict = {d.stock_id: d for d in dividends}
for stock in stocks:
stock.latest_dividend = dividends_dict.get(stock.id)

I'm a little confused by your question, I'm assuming you are trying to access the dividends from your stock object in order to limit your queries to the database. I believe that is the least number queries of possible.
stock_options = stock.objects.get(pk=your_query)
order_options = stock.dividend_set.order_by('-date')[:5]

likeon: Thanks for your answer. But I think I can avoid initializing that large dictionary (I have 5000 stocks and 280 000 dividends). But your list gave me an idea. Your code requires 2 sql queries. Here is my example (EDIT1).
for stock in Stock.objects.filter(symbol__in=shares.keys())\
.prefetch_related('dividends'): # 2 sql queries
latest_dividend = list(stock.dividends.all())[-1] # 0 sql queries
... #do something with latest_dividend
My code also requires 2 sql queries, but I do not have to reorder it and create list from stocks and all 280 000 dividends (I only create dict from current stock dividends every cycle). May be creating one dict is quicker than creating len(shares.keys()) dicts, not sure.
I thought there would be easier solution (avoid creating list/dictionary from dividends), but this is good enough for now. Thanks for answers!

As long as I understood you can do it this way:
stock.dividends.last()
as implementation in Django is like this:
def first(self):
"""Return the first object of a query or None if no match is found."""
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
return obj
Also, you can use .latest(*fields, field_name=None) too.

Django ORM: Select items where latest status is `success`

I have this model.
class Item(models.Model):
name=models.CharField(max_length=128)
An Item gets transferred several times. A transfer can be successful or not.
class TransferLog(models.Model):
item=models.ForeignKey(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
How can I query for all Items which latest TransferLog was successful?
With "latest" I mean ordered by timestamp.
TransferLog Table
Here is a data sample. Here item1 should not be included, since the last transfer was not successful:
ID|item_id|timestamp |success
---------------------------------------
1 | item1 |2014-11-15 12:00:00 | False
2 | item1 |2014-11-15 14:00:00 | True
3 | item1 |2014-11-15 16:00:00 | False
I know how to solve this with a loop in python, but I would like to do the query in the database.

An efficient trick is possible if timestamps in the log are increasing, that is the end of transfer is logged as timestamp (not the start of transfer) or if you can expect ar least that the older transfer has ended before a newer one started. Than you can use the TransferLog object with the highest id instead of with the highest timestamp.
from django.db.models import Max
qs = TransferLog.objects.filter(id__in=TransferLog.objects.values('item')
.annotate(max_id=Max('id')).values('max_id'), success=True)
It makes groups by item_id in the subquery and sends the highest id for every group to the main query, where it is filtered by success of the latest row in the group.
You can see that it is compiled to the optimal possible one query directly by Django.
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql())
SELECT L.id, L.item_id, L.timestamp, L.success FROM app_transferlog L
WHERE L.success = true AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0 GROUP BY U0.item_id )
(I edited the example result compiled SQL for better readability by replacing many "app_transferlog"."field" by a short alias L.field, by substituting the True parameter directly into SQL and by editing whitespace and parentheses.)
It can be improved by adding some example filter and by selecting the related Item in the same query:
kwargs = {} # e.g. filter: kwargs = {'timestamp__gte': ..., 'timestamp__lt':...}
qs = TransferLog.objects.filter(
id__in=TransferLog.objects.filter(**kwargs).values('item')
.annotate(max_id=Max('id')).values('max_id'),
success=True).select_related('item')
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql()[0])
SELECT L.id, L.item_id, L.timestamp, L.success, I.id, I.name
FROM app_transferlog L INNER JOIN app_item I ON ( L.item_id = I.id )
WHERE L.success = %s AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0
WHERE U0.timestamp >= %s AND U0.timestamp < %s
GROUP BY U0.item_id )
print(qs.query.get_compiler('default').as_sql()[1])
# result
(True, <timestamp_start>, <timestamp_end>)
Useful fields of latest TransferLog and the related Items are acquired by one query:
for logitem in qs:
item = logitem.item # the item is still cached in the logitem
...
The query can be more optimized according to circumstances, e.g. if you are not interested in the timestamp any more and you work with big data...
Without assumption of increasing timestamps it is really more complicated than a plain Django ORM. My solutions can be found here.
EDIT after it has been accepted:
An exact solution for a non increasing dataset is possible by two queries:
Get a set of id of the last failed transfers. (Used fail list, because it is much smaller small than the list of successful tranfers.)
Iterate over the list of all last transfers. Exclude items found in the list of failed transfers.
This way can be be efficiently simulated queries that would otherwise require a custom SQL:
SELECT a_boolean_field_or_expression,
rank() OVER (PARTITION BY parent_id ORDER BY the_maximized_field DESC)
FROM ...
WHERE rank = 1 GROUP BY parent_object_id
I'm thinking about implementing an aggregation function (e.g. Rank(maximized_field) ) as an extension for Django with PostgresQL, how it would be useful.

try this
# your query
items_with_good_translogs = Item.objects.filter(id__in=
(x.item.id for x in TransferLog.objects.filter(success=True))
since you said "How can I query for all Items which latest TransferLog was successful?", it is logically easy to follow if you start the query with Item model.
I used the Q Object which can be useful in places like this. (negation, or, ...)
(x.item.id for x in TransferLog.objects.filter(success=True)
gives a list of TransferLogs where success=True is true.

You will probably have an easier time approaching this from the ItemLog thusly:
dataset = ItemLog.objects.order_by('item','-timestamp').distinct('item')
Sadly that does not weed out the False items and I can't find a way to apply the filter AFTER the distinct. You can however filter it after the fact with python listcomprehension:
dataset = [d.item for d in dataset if d.success]
If you are doing this for logfiles within a given timeperiod it's best to filter that before ordering and distinct-ing:
dataset = ItemLog.objects.filter(
timestamp__gt=start,
timestamp__lt=end
).order_by(
'item','-timestamp'
).distinct('item')

If you can modify your models, I actually think you'll have an easier time using ManyToMany relationship instead of explicit ForeignKey -- Django has some built-in convenience methods that will make your querying easier. Docs on ManyToMany are here. I suggest the following model:
class TransferLog(models.Model):
item=models.ManyToManyField(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
Then you could do (I know, not a nice, single-line of code, but I'm trying to be explicit to be clearer):
results = []
for item in Item.objects.all():
if item.transferlog__set.all().order_by('-timestamp')[0].success:
results.append(item)
Then your results array will have all the items whose latest transfer was successful. I know, it's still a loop in Python...but perhaps a cleaner loop.

Django QuerySet update performance

Which one would be better for performance?
We take a slice of products. which make us impossible to bulk update.
products = Product.objects.filter(featured=True).order_by("-modified_on")[3:]
for product in products:
product.featured = False
product.save()
or (invalid)
for product in products.iterator():
product.update(featured=False)
I have tried QuerySet's in statement too as following.
Product.objects.filter(pk__in=products).update(featured=False)
This line works fine on SQLite. But, it rises following exception on MySQL. So, I couldn't use that.
DatabaseError: (1235, "This version of MySQL doesn't yet support
'LIMIT & IN/ALL/ANY/SOME subquery'")
Edit: Also iterator() method causes re-evaluate the query. So, it is bad for performance.

As #Chris Pratt pointed out in comments, the second example is invalid because the objects don't have update methods. Your first example will require queries equal to results+1 since it has to update each object. That might really be costly if you have 1000 products. Ideally you do want to reduce this to a more fixed expense if possible.
This is a similar situation to another question:
Django: Cannot update a query once a slice has been taken
That being said, you would have to do it in at least 2 queries, but you have to be a bit sneaky on how to construct the LIMIT...
Using Q objects for complex queries:
# get the IDs we want to exclude
products = Product.objects.filter(featured=True).order_by("-modified_on")[:3]
# flatten them into just a list of ids
ids = products.values_list('id', flat=True)
# Now use the Q object to construct a complex query
from django.db.models import Q
# This builds a list of "AND id NOT EQUAL TO i"
limits = [~Q(id=i) for i in ids]
Product.objects.filter(featured=True, *limits).update(featured=False)

In some cases it's acceptable to cache QuerySet in array
products = list(products)
Product.objects.filter(pk__in=products).update(featured=False)
Small optimization with values_list
products_id = list(products.values_list('id', flat=True)
Product.objects.filter(pk__in=products_id).update(featured=False)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django: Ordering a QuerySet based on a latest child models field - django

Related

How can I make this queryset more tolerable?

Django ORM: django aggregate over filtered reverse relation

how to get latest foreign key value in models.py

Django ORM: Select items where latest status is `success`

Django QuerySet update performance

Categories

Resources