How can I make this queryset more tolerable?

How can I make this queryset more tolerable? - django

I have the following class:
class Instance(models.Model):
products = models.ManyToManyField(Product, blank=True)
class Product(models.Model):
description = HTMLField(blank=True, null=True)
short_description = HTMLField(blank=True, null=True)
And this form that I use to update Instances
class InstanceModelForm(InstanceValidatorMixin, UpdateInstanceLastUpdatedMixin, forms.ModelForm):
class Meta:
model = Instance
products = forms.ModelMultipleChoiceField(required=False, queryset=Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
My instance-product table is sizable (~ 1000 rows) and ever since I've added the queryset for products I am seeing web requests that are timing out due heroku's 30 second request limit.
My goal is to do something to this queryset such that my users are no longer timing out.
I have the following insights:
Accuracy doesn't matter as much to me - It doesn't have to be very accurate. Yes I would like to sort products by the count of instances this product has linked to but if it's off by 5 or 10 it doesn't really matter
that much.
Limited number of products - When my users are selecting products to be linked to an instance, they are primarily interested in products with less than 10 total linkages to instances. I don't know if a partial query will be accurate, but if this is possible I am open to trying.
Effort - I know there are frameworks out there that I can install to cache many things. I am looking for something that is light weight and requires less than 1 hr to get up and running.

First I would want to ensure that the performance issue actually comes from the query. I've tried to reproduce your problem:
>>> Instance.objects.count()
102499
>>> Product.objects.count()
1000
>>> sum(p.instance_set.count() for p in Product.objects.all())/Product.objects.count()
273.084
>>> list(Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
[...]
>>> from django.db import connection
>>> connection.queries[-1]
{'sql': 'SELECT "products_product"."id", "products_product"."description", "products_product"."short_description", COUNT("products_instance_products"."instance_id") AS "i_count" FROM "products_product" LEFT OUTER JOIN "products_instance_products" ON ("products_product"."id" = "products_instance_products"."product_id") GROUP BY "products_product"."id", "products_product"."description", "products_product"."short_description" ORDER BY "i_count" ASC', 'time': '0.189'}
By accident, I created a dataset that is probably quite a bit bigger than yours. As you can see, I have 1000 Products with an average of ~273 related Instances, but the query still takes less than a second (both on SQLite and PostgreSQL).
Use a one-off dyno with heroku run bash and check if you get the same numbers.
My guess is that your performance issues are either caused by
an n+1 query, where an extra query is made for each Product, e.g. in your Product.__str__ method.
the actual rendering of the MultipleChoiceField field. By default, it will render as a <select> with an <option> for each Product. This can be quite slow, and even it wasn't, it would pretty inconvenient to use. You might want to use a different widget, like django-select2.

Related

is it possible to add db_index = True to a field that is not unique (django)

I have a model that has some fields like:
current_datetime = models.TimeField(auto_now_add=True)
new_datetime = models.DateTimeField(null=True, db_index=True)
and data would be like :
currun_date_time = 2023-01-22T09:42:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T09:52:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T10:02:00+0330 new_datetime =2023-01-22T10:00:00+0330
is it possible new_datetime to have db_index = True ?
the reason i want this index is there are many rows (more than a 200,000 and keep adding every day) and there is a place that user can choose datetime range and see the results(it's a statistical website). i want to send a query with that filtered datetime range so it should be done fast. by the way i am using postgresql
also if you have tips for handling data or sth. like that for such websites i would be glad too hear
thanks.

Yes, It is possible to have datetime field to be true. This could upgrade the performance of queries that sort or screen by the given field.
Other better ways to have an index in datetime field is:
To evaluate the query plan and detect any sluggish processes or
missing indexes, take advantage of the "explain" command of your
database.
Employ the "limit" and "offset" parameters within your queries to
get only the necessary data.
For retrieving associated data in a single query, rather than
numerous queries, incorporate the "select_related" and
"prefetch_related" methods in your Django queries.
To store the outcomes of elaborate queries and dodge running the
same query multiple times, make use of caching systems such as
Redis or Memcached.
Moreover, if there are too many rows and the data is not required
for a long period of time, you can contemplate filing the
information in another table or database.

How to optimize Django DateTimeField for null value lookups in postgres

I have a problem with deeper undestanding of indexing and its positive side. Lets assume such model
class SupportTicket(models.Model):
content = models.TextField()
closed_at = models.DateTimeField(default=None)
to keep it clean I do not add is_closed boolean field as it would be redundant (since closed_at == None implies that ticket is open). As you can imagine the open Tickets will be looked up and called more frequently and thats why I would like to optimize it database-wise. I am using postgres and my desired effect is to speed up this filter
active_tickets = SupportTicket.objects.filter(closed_at__isnull=True)
I know that postgres support DateTime indexing but I have no knowleadge nor experience with null/not null speed ups. My guess looks like this
class SupportTicket(models.Model):
class Meta:
indexes = [models.Index(
name='ticket_open_condition',
fields=['closed_at'],
condition=Q(closed_at__isnull=True))
]
content = models.TextField()
closed_at = models.DateTimeField(default=None)
but I have no idea if it will speed up the query at all. The db will grow in about 200 Tickets a day and will be queried at around 10 000 a day. I know its not that much but UX (speed) really matters here. I will be grateful for any suggestions on how to improve this model definition.

Solving a slow query with a Foreignkey that has isnull=False and order_by in a Django ListView

I have a Django ListView that allows to paginate through 'active' People.
The (simplified) models:
class Person(models.Model):
name = models.CharField()
# ...
active_schedule = models.ForeignKey('Schedule', related_name='+', null=True, on_delete=models.SET_NULL)
class Schedule(models.Model):
field = models.PositiveIntegerField(default=0)
# ...
person = models.ForeignKey(Person, related_name='schedules', on_delete=models.CASCADE)
The Person table contains almost 700.000 rows and the Schedule table contains just over 2.000.000 rows (on average every Person has 2-3 Schedule records, although many have none and a lot have more). For an 'active' Person, the active_schedule ForeignKey is set, of which there are about 5.000 at any time.
The ListView is supposed to show all active Person's, sorted by field on Schedule (and some other conditions, that don't seem to matter for this case).
The query then becomes:
Person.objects
.filter(active_schedule__isnull=False)
.select_related('active_schedule')
.order_by('active_schedule__field')
Specifically the order_by on the related field makes this query terribly slow (that is: it takes about a second, which is too slow for a web app).
I was hoping the filter condition would select the 5000 records, which then become relatively easily sortable. But when I run explain on this query, it shows that the (Postgres) database is messing with many more rows:
Gather Merge (cost=224316.51..290280.48 rows=565366 width=227)
Workers Planned: 2
-> Sort (cost=223316.49..224023.19 rows=282683 width=227)
Sort Key: exampledb_schedule.field
-> Parallel Hash Join (cost=89795.12..135883.20 rows=282683 width=227)
Hash Cond: (exampledb_person.active_schedule_id = exampledb_schedule.id)
-> Parallel Seq Scan on exampledb_person (cost=0.00..21263.03 rows=282683 width=161)
Filter: (active_schedule_id IS NOT NULL)
-> Parallel Hash (cost=67411.27..67411.27 rows=924228 width=66)
-> Parallel Seq Scan on exampledb_schedule (cost=0.00..67411.27 rows=924228 width=66)
I recently changed the models to be this way. In a previous version I had a model with just the ~5.000 active Person's in it. Doing the order_by on this small table was considerably faster! I am hoping to achieve the same speed with the current models.
I tried retrieving just the fields needed for the Listview (using values) which does help a little, but not much. I also tried setting the related_name on active_schedule and approaching the problem from Schedule, but that makes no difference. I tried putting a db_index on the Schedule.field, but that seems only to make things slower. Conditional queries also did not help (although I probably have not tried all possibilities). I'm at a loss.
The SQL statement generated by the ORM query:
SELECT
"exampledb_person"."id",
"exampledb_person"."name",
...
"exampledb_person"."active_schedule_id",
"exampledb_person"."created",
"exampledb_person"."updated",
"exampledb_schedule"."id",
"exampledb_schedule"."person_id",
"exampledb_schedule"."field",
...
"exampledb_schedule"."created",
"exampledb_schedule"."updated"
FROM
"exampledb_person"
INNER JOIN
"exampledb_schedule"
ON ("exampledb_person"."active_schedule_id" = "exampledb_schedule"."id")
WHERE
"exampledb_person"."active_schedule_id" IS NOT NULL
ORDER BY
"exampledb_schedule"."field" ASC
(Some fields were left out, for simplicity.)
Is it possible to speed up this query, or should I revert back to using a special Model for the active Person's?
EDIT: When I change the query, just for comparison/testing, to sort on an UNindexed field on Person, the query is equally show. However, if I then add an index to that field, the query is fast! I had to try this, as the SQL statement indeed shows that it's ordering on "exampledb_schedule"."field" - a field without index, but like I said: adding an index on the field makes no difference.
EDIT: I suppose it's also worth noting that when trying a much simpler sort query directly on Schedule, either on an indexed field or not, it's MUCH faster. For instance, for this test I've added an index to Schedule.field, then the following query is blazing fast:
Schedule.objects.order_by('field')
Somewhere in here lies the solution...

The comments by #guarav and my edits pointed me in the direction of the solution, which was staring in my face for a while...
The filter clause in my questions - filter(active_schedule__isnull=False) - seems to invalidate the database indexes. I wasn't aware of this, and had hoped a database expert would point me in this direction.
The solution is to filter on Schedule.field, which is 0 for inactive Person records and >0 for active ones:
Person.objects
.select_related('active_schedule')
.filter(active_schedule__field__gte=1)
.order_by('active_schedule__field')
This query properly uses the indexes and is fast (20ms opposed to ~1000ms).

Django: efficient semi-random order_by for user-friendly results?

I have a Django search app with a Postgres back-end that is populated with cars. My scripts load on a brand-by-brand basis: let's say a typical mix is 50 Chryslers, 100 Chevys, and 1500 Fords, loaded in that order.
The default ordering is by creation date:
class Car(models.Model):
name = models.CharField(max_length=500)
brand = models.ForeignKey(Brand, null=True, blank=True)
transmission = models.CharField(max_length=50, choices=TRANSMISSIONS)
created = models.DateField(auto_now_add=True)
class Meta:
ordering = ['-created']
My problem is this: typically when the user does a search, say for a red automatic, and let's say that returns 10% of all cars:
results = Cars.objects.filter(transmission="automatic", color="red")
the user typically gets hundreds of Fords first before any other brand (because the results are ordered by date_added) which is not a good experience.
I'd like to make sure the brands are as evenly distributed as possible among the early results, without big "runs" of one brand. Any clever suggestions for how to achieve this?
The only idea I have is to use the ? operator with order_by:
results = Cars.objects.filter(transmission="automatic", color="red").order_by("?")
This isn't ideal. It's expensive. And it doesn't guarantee a good mix of results, if some brands are much more common than others - so here where Chrysler and Chevy are in the minority, the user is still likely to see lots of Fords first. Ideally I'd show all the Chryslers and Chevys in the first 50 results, nicely mixed in with Fords.
Any ideas on how to achieve a user-friendly ordering? I'm stuck.

What I ended up doing was adding a priority field on the model, and assigning a semi-random integer to it.
By semi-random, I mean random within a particular range: so 1-100 for Chevy and Chrysler, and 1-500 for Ford.
Then I used that field to order_by.

Django: Ordering a QuerySet based on a latest child models field

Lets assume I want to show a list of runners ordered by their latest sprint time.
class Runner(models.Model):
name = models.CharField(max_length=255)
class Sprint(models.Model):
runner = models.ForeignKey(Runner)
time = models.PositiveIntegerField()
created = models.DateTimeField(auto_now_add=True)
This is a quick sketch of what I would do in SQL:
SELECT runner.id, runner.name, sprint.time
FROM runner
LEFT JOIN sprint ON (sprint.runner_id = runner.id)
WHERE
sprint.id = (
SELECT sprint_inner.id
FROM sprint as sprint_inner
WHERE sprint_inner.runner_id = runner.id
ORDER BY sprint_inner.created DESC
LIMIT 1
)
OR sprint.id = NULL
ORDER BY sprint.time ASC
The Django QuerySet documentation states:
It is permissible to specify a multi-valued field to order the results
by (for example, a ManyToManyField field). Normally this won’t be a
sensible thing to do and it’s really an advanced usage feature.
However, if you know that your queryset’s filtering or available data
implies that there will only be one ordering piece of data for each of
the main items you are selecting, the ordering may well be exactly
what you want to do. Use ordering on multi-valued fields with care and
make sure the results are what you expect.
I guess I need to apply some filter here, but I'm not sure what exactly Django expects...
One note because it is not obvious in this example: the Runner table will have several hundred entries, the sprints will also have several hundreds and in some later days probably several thousand entries. The data will be displayed in a paginated view, so sorting in Python is not an option.
The only other possibility I see is writing the SQL myself, but I'd like to avoid this at all cost.

I don't think there's a way to do this via the ORM with only one query, you could grab a list of runners and use annotate to add their latest sprint id's -- then filter and order those sprints.
>>> from django.db.models import Max
# all runners now have a `last_race` attribute,
# which is the `id` of the last sprint they ran
>>> runners = Runner.objects.annotate(last_race=Max("sprint__id"))
# a list of each runner's last sprint ordered by the the sprint's time,
# we use `select_related` to limit lookup queries later on
>>> results = Sprint.objects.filter(id__in=[runner.last_race for runner in runners])
... .order_by("time")
... .select_related("runner")
# grab the first result
>>> first_result = results[0]
# you can access the runner's details via `.runner`, e.g. `first_result.runner.name`
>>> isinstance(first_result.runner, Runner)
True
# this should only ever execute 2 queries, no matter what you do with the results
>>> from django.db import connection
>>> len(connection.queries)
2
This is pretty fast and will still utilize the databases's indices and caching.
A few thousand records isn't all that much, this should work pretty well for those kinds of numbers. If you start running into problems, I suggest you bite the bullet and use raw SQL.

def view_name(request):
spr = Sprint.objects.values('runner', flat=True).order_by(-created).distinct()
runners = []
for s in spr:
latest_sprint = Sprint.objects.filter(runner=s.runner).order_by(-created)[:1]
for latest in latest_sprint:
runners.append({'runner': s.runner, 'time': latest.time})
return render(request, 'page.html', {
'runners': runners,
})
{% for runner in runners %}
{{runner.runner}} - {{runner.time}}
{% endfor %}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I make this queryset more tolerable? - django

Related

is it possible to add db_index = True to a field that is not unique (django)

How to optimize Django DateTimeField for null value lookups in postgres

Solving a slow query with a Foreignkey that has isnull=False and order_by in a Django ListView

Django: efficient semi-random order_by for user-friendly results?

Django: Ordering a QuerySet based on a latest child models field

Categories

Resources