Converting SQL to something that Django can use - django

I am working on converting some relatively complex SQL into something that Django can play with. I am trying not to just use the raw SQL, since I think playing with the standard Django toolkit will help me learn more about Django.
I have already managed to break up parts of the sql into chunks, and am tackling them piecemeal to make things a little easier.
Here is the SQL in question:
SELECT i.year, i.brand, i.desc, i.colour, i.size, i.mpn, i.url,
COALESCE(DATE_FORMAT(i_eta.eta, '%M %Y'),'Unknown')
as eta
FROM i
JOIN i_eta ON i_eta.mpn = i.mpn
WHERE category LIKE 'kids'
ORDER BY i.brand, i.desc, i.colour, FIELD(size, 'xxl','xl','l','ml','m','s','xs','xxs') DESC, size+0, size
Here is what I have (trying to convert line by line):
(grabbed automatically when performing filters)
(have to figure out django doc on coalesce for syntax)
db alias haven't been able to find yet - it is crucial since there is a db view that requires it
already included in the original q
.select_related?
.filter(category="kids")
.objects.order_by('brand','desc','colour') - don't know how to deal with SQL FIELDS
Any advice would be appreciated!

Here's how I would structure this.
First, I'm assuming your models for i and i_eta look something like this:
class I(models.Model):
mpn = models.CharField(max_length=30, primary_key=True)
year = models.CharField(max_length=30)
brand = models.CharField(max_length=30)
desc = models.CharField(max_length=100)
colour = models.CharField(max_length=30)
size = models.CharField(max_length=3)
class IEta(models.Model):
i = models.ForeignKey(I, on_delete=models.CASCADE)
eta = models.DateField()
General thoughts:
To write the coalesce in Django: I would not replace nulls with "Unknown" in the ORM. This is a presentation-layer concern: it should be dealt with in a template.
For date formatting, you can do date formatting in Python.
Not sure what a DB alias is.
For using multiple tables together, you can use either select_related(), prefetch_related(), or do nothing.
select_related() will perform a join.
prefect_related() will get the foreign key ID's from the first queryset, then generate a query like SELECT * FROM table WHERE id in (12, 13, 14).
Doing nothing will work automatically, but has the disadvantage of the SELECT N+1 problem.
I generally prefer prefetch_related().
For customizing the sort order of the size field, you have three options. My preference would be option 1, but any of the three will work.
Denormalize the sort criteria. Add a new field called size_numeric. Override the save() method to populate this field when saving new instances, giving xxl the value 1, xl the value 2, etc.
Sort in Python. Essentially, you use Python's built-in sorting methods to do the sort, rather than sorting it in the database. This doesn't work well if you have thousands of results.
Invoke the MySQL function. Essentially, using annotate(), you add the output of a function to the queryset. order_by() can sort by that function.

Related

is it possible to add db_index = True to a field that is not unique (django)

I have a model that has some fields like:
current_datetime = models.TimeField(auto_now_add=True)
new_datetime = models.DateTimeField(null=True, db_index=True)
and data would be like :
currun_date_time = 2023-01-22T09:42:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T09:52:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T10:02:00+0330 new_datetime =2023-01-22T10:00:00+0330
is it possible new_datetime to have db_index = True ?
the reason i want this index is there are many rows (more than a 200,000 and keep adding every day) and there is a place that user can choose datetime range and see the results(it's a statistical website). i want to send a query with that filtered datetime range so it should be done fast. by the way i am using postgresql
also if you have tips for handling data or sth. like that for such websites i would be glad too hear
thanks.
Yes, It is possible to have datetime field to be true. This could upgrade the performance of queries that sort or screen by the given field.
Other better ways to have an index in datetime field is:
To evaluate the query plan and detect any sluggish processes or
missing indexes, take advantage of the "explain" command of your
database.
Employ the "limit" and "offset" parameters within your queries to
get only the necessary data.
For retrieving associated data in a single query, rather than
numerous queries, incorporate the "select_related" and
"prefetch_related" methods in your Django queries.
To store the outcomes of elaborate queries and dodge running the
same query multiple times, make use of caching systems such as
Redis or Memcached.
Moreover, if there are too many rows and the data is not required
for a long period of time, you can contemplate filing the
information in another table or database.

Solving a slow query with a Foreignkey that has isnull=False and order_by in a Django ListView

I have a Django ListView that allows to paginate through 'active' People.
The (simplified) models:
class Person(models.Model):
name = models.CharField()
# ...
active_schedule = models.ForeignKey('Schedule', related_name='+', null=True, on_delete=models.SET_NULL)
class Schedule(models.Model):
field = models.PositiveIntegerField(default=0)
# ...
person = models.ForeignKey(Person, related_name='schedules', on_delete=models.CASCADE)
The Person table contains almost 700.000 rows and the Schedule table contains just over 2.000.000 rows (on average every Person has 2-3 Schedule records, although many have none and a lot have more). For an 'active' Person, the active_schedule ForeignKey is set, of which there are about 5.000 at any time.
The ListView is supposed to show all active Person's, sorted by field on Schedule (and some other conditions, that don't seem to matter for this case).
The query then becomes:
Person.objects
.filter(active_schedule__isnull=False)
.select_related('active_schedule')
.order_by('active_schedule__field')
Specifically the order_by on the related field makes this query terribly slow (that is: it takes about a second, which is too slow for a web app).
I was hoping the filter condition would select the 5000 records, which then become relatively easily sortable. But when I run explain on this query, it shows that the (Postgres) database is messing with many more rows:
Gather Merge (cost=224316.51..290280.48 rows=565366 width=227)
Workers Planned: 2
-> Sort (cost=223316.49..224023.19 rows=282683 width=227)
Sort Key: exampledb_schedule.field
-> Parallel Hash Join (cost=89795.12..135883.20 rows=282683 width=227)
Hash Cond: (exampledb_person.active_schedule_id = exampledb_schedule.id)
-> Parallel Seq Scan on exampledb_person (cost=0.00..21263.03 rows=282683 width=161)
Filter: (active_schedule_id IS NOT NULL)
-> Parallel Hash (cost=67411.27..67411.27 rows=924228 width=66)
-> Parallel Seq Scan on exampledb_schedule (cost=0.00..67411.27 rows=924228 width=66)
I recently changed the models to be this way. In a previous version I had a model with just the ~5.000 active Person's in it. Doing the order_by on this small table was considerably faster! I am hoping to achieve the same speed with the current models.
I tried retrieving just the fields needed for the Listview (using values) which does help a little, but not much. I also tried setting the related_name on active_schedule and approaching the problem from Schedule, but that makes no difference. I tried putting a db_index on the Schedule.field, but that seems only to make things slower. Conditional queries also did not help (although I probably have not tried all possibilities). I'm at a loss.
The SQL statement generated by the ORM query:
SELECT
"exampledb_person"."id",
"exampledb_person"."name",
...
"exampledb_person"."active_schedule_id",
"exampledb_person"."created",
"exampledb_person"."updated",
"exampledb_schedule"."id",
"exampledb_schedule"."person_id",
"exampledb_schedule"."field",
...
"exampledb_schedule"."created",
"exampledb_schedule"."updated"
FROM
"exampledb_person"
INNER JOIN
"exampledb_schedule"
ON ("exampledb_person"."active_schedule_id" = "exampledb_schedule"."id")
WHERE
"exampledb_person"."active_schedule_id" IS NOT NULL
ORDER BY
"exampledb_schedule"."field" ASC
(Some fields were left out, for simplicity.)
Is it possible to speed up this query, or should I revert back to using a special Model for the active Person's?
EDIT: When I change the query, just for comparison/testing, to sort on an UNindexed field on Person, the query is equally show. However, if I then add an index to that field, the query is fast! I had to try this, as the SQL statement indeed shows that it's ordering on "exampledb_schedule"."field" - a field without index, but like I said: adding an index on the field makes no difference.
EDIT: I suppose it's also worth noting that when trying a much simpler sort query directly on Schedule, either on an indexed field or not, it's MUCH faster. For instance, for this test I've added an index to Schedule.field, then the following query is blazing fast:
Schedule.objects.order_by('field')
Somewhere in here lies the solution...
The comments by #guarav and my edits pointed me in the direction of the solution, which was staring in my face for a while...
The filter clause in my questions - filter(active_schedule__isnull=False) - seems to invalidate the database indexes. I wasn't aware of this, and had hoped a database expert would point me in this direction.
The solution is to filter on Schedule.field, which is 0 for inactive Person records and >0 for active ones:
Person.objects
.select_related('active_schedule')
.filter(active_schedule__field__gte=1)
.order_by('active_schedule__field')
This query properly uses the indexes and is fast (20ms opposed to ~1000ms).

django subquery with a join in it

I've got django 1.8.5 and Python 3.4.3, and trying to create a subquery that constrains my main data set - but the subquery itself (I think) needs a join in it. Or maybe there is a better way to do it.
Here's a trimmed down set of models:
class Lot(models.Model):
lot_id = models.CharField(max_length=200, unique=True)
class Lot_Country(models.Model):
lot = models.ForeignKey(Lot)
country = CountryField()
class Discrete(models.Model):
discrete_id = models.CharField(max_length=200, unique=True)
master_id = models.ForeignKey(Inventory_Master)
location = models.ForeignKey(Location)
lot = models.ForeignKey(Lot)
I am filtering on various attributes of Discrete (which is discrete supply) and I want to go "up" through Lot, over the Lot_Country, meaning "I only want to get rows from Discrete if the Lot associated with that row has an entry in Lot_Country for my appropriate country (let's say US.)
I've tried something like this:
oklots=list(Lot_Country.objects.filter(country='US'))
But, first of all that gives me the str back, which I don't really want (and changed it to be lot_id, but that's a hack.)
What's the best way to constrain Discrete through Lot and over to Lot_Country? In SQL I would just join in the subquery (or even in the main query - maybe that's what I need? I guess I don't know how to join up to a parent then down into that parent's other child...)
Thanks in advance for your help.
I'm not sure what you mean by "it gives me the str back"... Lot_Country.objects.filter(country='US') will return a queryset. Of course if you print it in your console, you will see a string.
I also think your models need refactoring. The way you have currently defined it, you can associate multiple Lot_Countrys with one Lot, and a country can only be associated with one lot.
If I understand your general model correctly that isn't what you want - you want to associate multiple Lots with one Lot_Country. To do that you need to reverse your foreign key relationship (i.e., put it inside the Lot).
Then, for fetching all the Discrete lots that are in a given country, you would do:
discretes_in_us = Discrete.objects.filter(lot__lot_country__country='US')
Which will give you a queryset of all Discretes whose Lot is in the US.

Select DISTINCT individual columns in django?

I'm curious if there's any way to do a query in Django that's not a "SELECT * FROM..." underneath. I'm trying to do a "SELECT DISTINCT columnName FROM ..." instead.
Specifically I have a model that looks like:
class ProductOrder(models.Model):
Product = models.CharField(max_length=20, promary_key=True)
Category = models.CharField(max_length=30)
Rank = models.IntegerField()
where the Rank is a rank within a Category. I'd like to be able to iterate over all the Categories doing some operation on each rank within that category.
I'd like to first get a list of all the categories in the system and then query for all products in that category and repeat until every category is processed.
I'd rather avoid raw SQL, but if I have to go there, that'd be fine. Though I've never coded raw SQL in Django/Python before.
One way to get the list of distinct column names from the database is to use distinct() in conjunction with values().
In your case you can do the following to get the names of distinct categories:
q = ProductOrder.objects.values('Category').distinct()
print q.query # See for yourself.
# The query would look something like
# SELECT DISTINCT "app_productorder"."category" FROM "app_productorder"
There are a couple of things to remember here. First, this will return a ValuesQuerySet which behaves differently from a QuerySet. When you access say, the first element of q (above) you'll get a dictionary, NOT an instance of ProductOrder.
Second, it would be a good idea to read the warning note in the docs about using distinct(). The above example will work but all combinations of distinct() and values() may not.
PS: it is a good idea to use lower case names for fields in a model. In your case this would mean rewriting your model as shown below:
class ProductOrder(models.Model):
product = models.CharField(max_length=20, primary_key=True)
category = models.CharField(max_length=30)
rank = models.IntegerField()
It's quite simple actually if you're using PostgreSQL, just use distinct(columns) (documentation).
Productorder.objects.all().distinct('category')
Note that this feature has been included in Django since 1.4
User order by with that field, and then do distinct.
ProductOrder.objects.order_by('category').values_list('category', flat=True).distinct()
The other answers are fine, but this is a little cleaner, in that it only gives the values like you would get from a DISTINCT query, without any cruft from Django.
>>> set(ProductOrder.objects.values_list('category', flat=True))
{u'category1', u'category2', u'category3', u'category4'}
or
>>> list(set(ProductOrder.objects.values_list('category', flat=True)))
[u'category1', u'category2', u'category3', u'category4']
And, it works without PostgreSQL.
This is less efficient than using a .distinct(), presuming that DISTINCT in your database is faster than a python set, but it's great for noodling around the shell.
Update:
This is answer is great for making queries in the Django shell during development. DO NOT use this solution in production unless you are absolutely certain that you will always have a trivially small number of results before set is applied. Otherwise, it's a terrible idea from a performance standpoint.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])