Evaluating Django Chained QuerySets Locally - django

I am hoping someone can help me out with a quick question I have regarding chaining Django querysets. I am noticing a slow down because I am evaluating many data points in the database to create data trends. I was wondering if there was a way to have the chained filters evaluated locally instead of hitting the database. Here is a (crude) example:
pastries = Bakery.objects.filter(productType='pastry') # <--- will obviously always hit DB, when evaluated
cannoli = pastries.filter(specificType='cannoli') # <--- can this be evaluated locally instead of hitting the DB when evaluated, as long as pastries was evaluated?
I have checked the docs and I do not see anything specifying this, so I guess it's not possible, but I wanted to check with the 'braintrust' first ;-).
BTW - I know that I can do this myself by implementing some methods to loop through these datapoints and evaluate the criteria, but there are so many datapoints that my deadline does not permit me manually implementing this.
Thanks in advance.

QuerySet methods always produce SQL that returns the desired expression. This is why you cannot e.g. call various methods after slicing; SQL does not support that syntax. The ORM does nothing more than assemble said SQL. If you want fancier processing then you will need to perform it in Python code yourself.

Related

Django: SQL way (i.e using annotation, aggregate, subquery, when, case, window etc) vs python way

Sql way means using using annotation, aggregate, subquery, when, case, window etc. i.e get all the extra calculation columns from sql
Till now if I want to get some addition information from the data of each row like calculate something etc, I used to get the table in queryset and loop over each object and then store the desired result in a list of dicts and pass it to the template. Ofcourse i was using prefetch so that i can avoid N+1 queries.
But since django 1.11 we can do the same thing in more expressive way using sql (i.e using annotation, aggregate, subquery, when, case, window etc) and no python.
Disadvantage:
One disadvantage i found using sql way rather than python way is debugging.
I have to do complex calculations on the data of each queryset object. So that i can check step by step.
If i do it in the sql way i can only see the final result but not be able to track the steps.
Advantage:
I didnt tried, but heard from The Dramatic Benefits of Django Subqueries and Annotations that its quite fast.
Presently most of my sql takes less than 100 ms and most of the time goes in dom loading. So by using sql way will it help me any way.
I appreciate that Django created more functions which help to write the sql expressively.

What's wrong with django queryset? I get different answer when access the same indice

I found that the objects could be duplicate in a queryset. However, when I try to access each of the object and do nothing, it changes and seems to be right.
Here are the commands I have typed into the shell
At first I gained a queryset orderby the field 'receiveTime'. Then it seems that ds[1996] equals to ds[1997]. And I try to use the loop:
for d in ds:
pass
Then the ds[1996] isn't equal to ds[1997], but what have I done?
Maybe it is a feature of the lazy search?
plus 1:I have reproduced it just now. I didn't do any inserting or deleting just now.
These are the commands I just typed into the shell.
plus 2:I have seen the raw sql queries when I call the ds[0] and ds[1] which I have shown in the picture 2. The sql queries are correct but the answer seems to be wrong. I think maybe the reason is that the sorting parameter receiveTime of two objects are the same, which lead to the disorder of the objects?
Here are the raw sql queries
Replace order_by("receive_time") with order_by("receive_time", "id"). PostgreSQL uses qsort which is an unstable sort. Given only receive_time, if values are the same, the order is not guaranteed.
Don't post code or logs in images. Ever.

Will Django use previously-evaluated results when applying additional filters to a query set?

Let's say I need to do some work both on a set of model objects, as well as a subset of the first set:
things = Thing.objects.filter(active=True)
for thing in things: # (1)
pass # do something with each `thing`
special_things = things.filter(special=True)
for thing in special_things: # (2)
pass # do something more with these things
My understanding is that at point (1) marked in the code above, an actual SQL query something like SELECT * FROM things_table WHERE active=1 will get executed against the database. The QuerySet documentation also says:
When a QuerySet is evaluated, it typically caches its results.
Now my question is, what happens at point (2) in the example Python code above?
Will Django execute a second SQL query, something like SELECT * FROM things_table WHERE active=1 AND special=1?
Or, will it use the cached result from earlier, automatically doing for me behind the scenes something like the more optimal filter(lambda d: d.special == True, things), i.e. avoiding a needless second trip to the database?
Either way, is the current behavior guaranteed (by documentation or something) or should I not rely on it? For example, it is not only a point of optimization, but could also make a possible logic difference if the database table is modified by another thread/process between the two potential queries.
It will execute a second SQL query. filter creates a new queryset, which doesn't copy the results cache.
As for guarantees - well, the docs specify that filter returns a new queryset object. I think you can be confident that that new queryset won't have cached results yet. As further support, the "when are querysets evaluated" docs suggest using .all() to get a new queryset if you want to pick up possibly changed results:
If the data in the database might have changed since a QuerySet was
evaluated, you can get updated results for the same query by calling
all() on a previously evaluated QuerySet.

How do I tell if a Django QuerySet has been evaluated?

I'm creating a Django queryset by hand and want to just use the Django ORM to read the resulting querset.query SQL itself without hitting my DB.
I know Django quersets are lazy and I see all the ops that trigger a queryset being evaluated:
https://docs.djangoproject.com/en/1.10/ref/models/querysets/#when-querysets-are-evaluated
But... what if I just want to verify my code is purely building the queryset guts but ISN'T evaluating and hitting my DB yet inadvertently? Are there any attributes on the queryset object I can use to verify it hasn't been evaluated without actually evaluating it?
For querysets that use a select to return lists of model instances, like a basic filter or exclude, the _result_cache attribute is None if the queryset has not been evaluated, or a list of results if it has. The usual caveats about non-public attributes apply.
Note that printing a queryset - although the docs note calling repr() as an evaluation trigger - does not in fact evaluate the original queryset. Instead, internally the original queryset chains into a new one so that it can limit the amount of data printed without changing the limits of the original queryset. It's true that it evaluates a subset of the original queryset and therefore hits the DB, so that's another weakness of this approach if you're in fact trying to use it to monitor all DB traffic.
For other querysets (count, delete, etc) I'm not sure there is a simple way. Maybe watch your database logs, or run in DEBUG mode and check connection.queries as described here:
https://docs.djangoproject.com/en/dev/faq/models/#how-can-i-see-the-raw-sql-queries-django-is-running

use fuzzy matching in django queryset filter

Is there a way to use fuzzy matching in a django queryset filter?
I'm looking for something along the lines of:
Object.objects.filter(fuzzymatch(namevariable)__gt=.9)
or is there a way to use lambda functions, or something similar in django queries, and if so, how much would it affect performance time (given that I have a stable set of ~6000 objects in my database that I want to match to)
(realized I should probably put my comments in the question)
I need something stronger than contains, something along the lines of difflib. I'm basically trying to get around doing a Object.objects.all() and then a list comprehension with fuzzy matching.
(although I'm not necessarily sure that doing that would be much slower than trying to filter based on a function, so if you have thoughts on that I'm happy to listen)
also, even though it's not exactly what I want, I'd be open to some kind of tokenized opposite-contains, like:
Object.objects.filter(['Virginia', 'Tech']__in=Object.name)
Where something like "Virginia Technical Institute" would be returned. Although case insensitive, preferably.
When you're using the ORM, the thing to understand is that everything you do converts to SQL commands and it's the performance of the underlying queries on the underlying database that matter. Case in point:
SELECT COUNT (*) ...
Is that fast? Depends on whether your database stores any records to give you that information - MySQL/MyISAM does, MySQL/InnoDB does not. In English - this is one lookup in MYISAM, and n in InnoDB.
Next thing - in order to do exact match lookups efficiently in SQL you have to tell it when you create the table - you can't just expect it to understand. For this purpose SQL has the INDEX statement - in django, use db_index=True in the field options of your model. Bear in mind that this has an added performance hit on writes (to create the index) and obviously extra storage is needed (for the datastructure) so you cannot "INDEX all the things". Also, I don't think it will help for fuzzy matching - but it's worth noting anyway.
Next consideration - how do we do fuzzy matching in SQL? Well apparently LIKE and CONTAINS allow a certain amount of searching and wildcard-results to be executed in SQL. These are T-SQL links - translate for your database server :) You can achieve this via Model.objects.get(fieldname__contains=value) which will produce LIKE SQL, or similar. There are a number of options available there for different lookups.
This may or may not be powerful enough for you - I'm not sure.
Now, for the big question: performance. Chances are if you're doing a contains search that the SQL server will have to hit all of the rows in the database - don't take my word on that, but it would be my bet - even with indexing on. With 6000 rows this might not take all that long; then again, if you're doing this on a per-connection-to-your-app basis it's probably going to create a slowdown.
Next thing to understand about the ORM: if you do this:
Model.objects.get(fieldname__contains=value)
Model.objects.get(fieldname__contains=value)
You will issue two queries to the database server. In other words, the ORM doesn't always cache the results - so you might just want to do an .all() and search in memory. Do read about caching and querysets.
Further on on that last page, you'll also see Q objects - useful for more complicated queries.
So in summary then:
SQL contains some basic fuzzy matching-like parameters.
Whether or not these are sufficient depends on your needs.
How they perform depends on your SQL server - definitely measure it.
Whether you can cache these results in memory depends on how likely scaling is - again might be worth measuring the memory commit as a result - if you can share between instances and if the cache will be frequently invalidated (if it will be, don't do it).
Ultimately, I'd start by getting your fuzzy matching working, then measure, then tweak, then measure until you work out how to improve performance. 99% of this I learnt doing exactly that :)
with postgres as database, you can use TrigramSimilarity to do fuzzy search and rank your results on different weight as well. Here is the link to documentation :
https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/#trigram-similarity
For full text search you can refer to https://czep.net/17/full-text-search.html
If you need something stronger than contains lookup, have a look at regex lookups: https://docs.djangoproject.com/en/1.0/ref/models/querysets/#regex