Django: efficient semi-random order_by for user-friendly results?

Django: efficient semi-random order_by for user-friendly results? - django

I have a Django search app with a Postgres back-end that is populated with cars. My scripts load on a brand-by-brand basis: let's say a typical mix is 50 Chryslers, 100 Chevys, and 1500 Fords, loaded in that order.
The default ordering is by creation date:
class Car(models.Model):
name = models.CharField(max_length=500)
brand = models.ForeignKey(Brand, null=True, blank=True)
transmission = models.CharField(max_length=50, choices=TRANSMISSIONS)
created = models.DateField(auto_now_add=True)
class Meta:
ordering = ['-created']
My problem is this: typically when the user does a search, say for a red automatic, and let's say that returns 10% of all cars:
results = Cars.objects.filter(transmission="automatic", color="red")
the user typically gets hundreds of Fords first before any other brand (because the results are ordered by date_added) which is not a good experience.
I'd like to make sure the brands are as evenly distributed as possible among the early results, without big "runs" of one brand. Any clever suggestions for how to achieve this?
The only idea I have is to use the ? operator with order_by:
results = Cars.objects.filter(transmission="automatic", color="red").order_by("?")
This isn't ideal. It's expensive. And it doesn't guarantee a good mix of results, if some brands are much more common than others - so here where Chrysler and Chevy are in the minority, the user is still likely to see lots of Fords first. Ideally I'd show all the Chryslers and Chevys in the first 50 results, nicely mixed in with Fords.
Any ideas on how to achieve a user-friendly ordering? I'm stuck.

What I ended up doing was adding a priority field on the model, and assigning a semi-random integer to it.
By semi-random, I mean random within a particular range: so 1-100 for Chevy and Chrysler, and 1-500 for Ford.
Then I used that field to order_by.

Related

How to optimize Django DateTimeField for null value lookups in postgres

I have a problem with deeper undestanding of indexing and its positive side. Lets assume such model
class SupportTicket(models.Model):
content = models.TextField()
closed_at = models.DateTimeField(default=None)
to keep it clean I do not add is_closed boolean field as it would be redundant (since closed_at == None implies that ticket is open). As you can imagine the open Tickets will be looked up and called more frequently and thats why I would like to optimize it database-wise. I am using postgres and my desired effect is to speed up this filter
active_tickets = SupportTicket.objects.filter(closed_at__isnull=True)
I know that postgres support DateTime indexing but I have no knowleadge nor experience with null/not null speed ups. My guess looks like this
class SupportTicket(models.Model):
class Meta:
indexes = [models.Index(
name='ticket_open_condition',
fields=['closed_at'],
condition=Q(closed_at__isnull=True))
]
content = models.TextField()
closed_at = models.DateTimeField(default=None)
but I have no idea if it will speed up the query at all. The db will grow in about 200 Tickets a day and will be queried at around 10 000 a day. I know its not that much but UX (speed) really matters here. I will be grateful for any suggestions on how to improve this model definition.

How can I make this queryset more tolerable?

I have the following class:
class Instance(models.Model):
products = models.ManyToManyField(Product, blank=True)
class Product(models.Model):
description = HTMLField(blank=True, null=True)
short_description = HTMLField(blank=True, null=True)
And this form that I use to update Instances
class InstanceModelForm(InstanceValidatorMixin, UpdateInstanceLastUpdatedMixin, forms.ModelForm):
class Meta:
model = Instance
products = forms.ModelMultipleChoiceField(required=False, queryset=Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
My instance-product table is sizable (~ 1000 rows) and ever since I've added the queryset for products I am seeing web requests that are timing out due heroku's 30 second request limit.
My goal is to do something to this queryset such that my users are no longer timing out.
I have the following insights:
Accuracy doesn't matter as much to me - It doesn't have to be very accurate. Yes I would like to sort products by the count of instances this product has linked to but if it's off by 5 or 10 it doesn't really matter
that much.
Limited number of products - When my users are selecting products to be linked to an instance, they are primarily interested in products with less than 10 total linkages to instances. I don't know if a partial query will be accurate, but if this is possible I am open to trying.
Effort - I know there are frameworks out there that I can install to cache many things. I am looking for something that is light weight and requires less than 1 hr to get up and running.

First I would want to ensure that the performance issue actually comes from the query. I've tried to reproduce your problem:
>>> Instance.objects.count()
102499
>>> Product.objects.count()
1000
>>> sum(p.instance_set.count() for p in Product.objects.all())/Product.objects.count()
273.084
>>> list(Product.objects.annotate(i_count=Count('instance')).order_by('i_count'))
[...]
>>> from django.db import connection
>>> connection.queries[-1]
{'sql': 'SELECT "products_product"."id", "products_product"."description", "products_product"."short_description", COUNT("products_instance_products"."instance_id") AS "i_count" FROM "products_product" LEFT OUTER JOIN "products_instance_products" ON ("products_product"."id" = "products_instance_products"."product_id") GROUP BY "products_product"."id", "products_product"."description", "products_product"."short_description" ORDER BY "i_count" ASC', 'time': '0.189'}
By accident, I created a dataset that is probably quite a bit bigger than yours. As you can see, I have 1000 Products with an average of ~273 related Instances, but the query still takes less than a second (both on SQLite and PostgreSQL).
Use a one-off dyno with heroku run bash and check if you get the same numbers.
My guess is that your performance issues are either caused by
an n+1 query, where an extra query is made for each Product, e.g. in your Product.__str__ method.
the actual rendering of the MultipleChoiceField field. By default, it will render as a <select> with an <option> for each Product. This can be quite slow, and even it wasn't, it would pretty inconvenient to use. You might want to use a different widget, like django-select2.

django subquery with a join in it

I've got django 1.8.5 and Python 3.4.3, and trying to create a subquery that constrains my main data set - but the subquery itself (I think) needs a join in it. Or maybe there is a better way to do it.
Here's a trimmed down set of models:
class Lot(models.Model):
lot_id = models.CharField(max_length=200, unique=True)
class Lot_Country(models.Model):
lot = models.ForeignKey(Lot)
country = CountryField()
class Discrete(models.Model):
discrete_id = models.CharField(max_length=200, unique=True)
master_id = models.ForeignKey(Inventory_Master)
location = models.ForeignKey(Location)
lot = models.ForeignKey(Lot)
I am filtering on various attributes of Discrete (which is discrete supply) and I want to go "up" through Lot, over the Lot_Country, meaning "I only want to get rows from Discrete if the Lot associated with that row has an entry in Lot_Country for my appropriate country (let's say US.)
I've tried something like this:
oklots=list(Lot_Country.objects.filter(country='US'))
But, first of all that gives me the str back, which I don't really want (and changed it to be lot_id, but that's a hack.)
What's the best way to constrain Discrete through Lot and over to Lot_Country? In SQL I would just join in the subquery (or even in the main query - maybe that's what I need? I guess I don't know how to join up to a parent then down into that parent's other child...)
Thanks in advance for your help.

I'm not sure what you mean by "it gives me the str back"... Lot_Country.objects.filter(country='US') will return a queryset. Of course if you print it in your console, you will see a string.
I also think your models need refactoring. The way you have currently defined it, you can associate multiple Lot_Countrys with one Lot, and a country can only be associated with one lot.
If I understand your general model correctly that isn't what you want - you want to associate multiple Lots with one Lot_Country. To do that you need to reverse your foreign key relationship (i.e., put it inside the Lot).
Then, for fetching all the Discrete lots that are in a given country, you would do:
discretes_in_us = Discrete.objects.filter(lot__lot_country__country='US')
Which will give you a queryset of all Discretes whose Lot is in the US.

Counting the number of related objects with a certain value in Django

This are simplified models to demonstrate my problem:
class User(models.Model):
username = models.CharField(max_length=30)
total_readers = models.IntegerField(default=0)
class Book(models.Model):
author = models.ForeignKey(User)
title = models.CharField(max_length=100)
class Reader(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
So, we have Users, Books and Readers (Users, who have read a Book). Thus, Reader is basically a many-to-many relationship between Book and User.
Now let's say, the current user reads a book. Now, I'd like to update the number of total readers for all books of this book's author:
# get the book (as an example pk=1)
book = Book.objects.get(pk=1)
# save Reader object for this user and this book
Reader(user=request.user, book=book).save()
# count and save the total number of readers for this author in all his books
book.author.total_readers = Reader.objects.filter(book__author=book.author).count()
book.author.save()
By doing so, Django creates a LEFT OUTER JOIN query for PostgreSQL and we get the expected result. However, the database tables are huge and this has become a bottleneck.
In this example, we could simply increase the total_readers by one on each view, instead of actually counting the database rows. However, this is just a simplified model structure and we cannot do this in reality here.
What I can do, is creating another field in the Reader model called book_author_id. Thus, I denormalize data and can count the Reader objects without having PostgreSQL making the LEFT OUTER JOIN with the User table.
Finally, here's my question: Is it possible to create some sort of database index, so that PostgreSQL handles this denormalization automatically? Or do I really have to create this additional model field and redundantly store the author's PK in there?
EDIT - to point out the essential question: I got several great answers, which work for a lot of scenarios. However, they don't solve this actual problem. The only thing I'd like to know, is if it's possible to have PostgreSQL handle such a denormalization automatically - e.g. by creating some sort of database index.

Sometimes, this query can serve better:
book.author.total_readers = Reader.objects.filter(book__in=Book.objects.filter(author=book.author)).count()
That will generate query with sub-query, sometimes it will have better performance that query with join. You even go further and end up creating 2 queries separately:
book.author.total_readers = Reader.objects.filter(book_id__in=Book.objects.filter(author=book.author).values_list('id', flat=True)).count()
That will generate 2 queries, one will retrieve list of all book IDs for that author and second will retrieve count of reads for books with ID in that list.

Good solution also may be to create some batch task that will run for example once per hour and count up all reads, but that way you will end up with not live refreshing count of reads.
You can also create celery task that will run just after read is created to generate new value for author. That way you won't have long response time and delay from creating read to counting it up won't be so long.

It's always way better to solve bottlenecks of this sort with good design and maybe a little bit of caching rather than duplicating data in the way you suggest. The total_readers field is data you should generate instead of recording.
class User(models.Model):
username = models.CharField(max_length=30)
#property
def total_readers(self):
cached_value = caching_client.get("readers_"+self.username, None)
if cached_value is None:
cached_value = self.readers()
caching_client.set("readers_"+self.username,
cached_value)
return cached_value
def readers(self):
return Reader.objects.filter(book__author__user=self).count()
There are libraries that do the caching via decorators but I felt it was a pattern you would benefit from seeing expressly. You can also attach a TTL to the cache so that you insure that the value can't be wrong for longer than TTL. You can also regenerate the cache upon creation of a Reader object.
You might actually get some mileage with declaring an m2m and defining through relationships but I have no experience of it.

Filtering on Dates for Availability in Django

Imagine a hostel keeping track of whether or not a room is available on any given night. In addition, if a party of more than 1 guest is looking for a room, they will only want a room with at least that many beds available.
Given a date range, I would like to find rooms that are available and have at least the number of beds as there are guests (along with other filtering).
How can I go about that without effectively chaining ANDs with .filters? (Which is how it works now - and is making my database very sad.)
I'm certainly open to a different scheme for storing the availability data if needed too.
Thanks! (Hypothetical classes below to give a better sense of the problem.)
class Room(models.Model):
name = models.CharField(max_length=100)
class RoomAvailability(models.Model):
room = models.ForeignKey(Rooms)
date = models.DateField()
beds = models.IntegerField(default=1)

available_rooms = (Room.objects
.filter(roomavailability__date__range=(start_date, end_date))
.values('roomavailability__date', 'pk')
.annotate(sum=Sum('roomavailability__beds'))
.filter(sum__gte=min_beds))
Update: forgot that we need room availability per day. This query will return sets of dates available and their room PK.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django: efficient semi-random order_by for user-friendly results? - django

What I ended up doing was adding a priority field on the model, and assigning a semi-random integer to it. By semi-random, I mean random within a particular range: so 1-100 for Chevy and Chrysler, and 1-500 for Ford. Then I used that field to order_by.

Related

How to optimize Django DateTimeField for null value lookups in postgres

How can I make this queryset more tolerable?

django subquery with a join in it

Counting the number of related objects with a certain value in Django

Filtering on Dates for Availability in Django

Categories

Resources