I'm creating a Django queryset by hand and want to just use the Django ORM to read the resulting querset.query SQL itself without hitting my DB.
I know Django quersets are lazy and I see all the ops that trigger a queryset being evaluated:
https://docs.djangoproject.com/en/1.10/ref/models/querysets/#when-querysets-are-evaluated
But... what if I just want to verify my code is purely building the queryset guts but ISN'T evaluating and hitting my DB yet inadvertently? Are there any attributes on the queryset object I can use to verify it hasn't been evaluated without actually evaluating it?
For querysets that use a select to return lists of model instances, like a basic filter or exclude, the _result_cache attribute is None if the queryset has not been evaluated, or a list of results if it has. The usual caveats about non-public attributes apply.
Note that printing a queryset - although the docs note calling repr() as an evaluation trigger - does not in fact evaluate the original queryset. Instead, internally the original queryset chains into a new one so that it can limit the amount of data printed without changing the limits of the original queryset. It's true that it evaluates a subset of the original queryset and therefore hits the DB, so that's another weakness of this approach if you're in fact trying to use it to monitor all DB traffic.
For other querysets (count, delete, etc) I'm not sure there is a simple way. Maybe watch your database logs, or run in DEBUG mode and check connection.queries as described here:
https://docs.djangoproject.com/en/dev/faq/models/#how-can-i-see-the-raw-sql-queries-django-is-running
Related
Lets say I am trying to query a table like so:
if MyModel.objects.filter(field1='some-value', field2='some-value').exists():
obj = MyModel.objects.select_related('related_model_1', 'related_model_2').get(field1='some-value', field2='some-value')
else:
return Response({'detail': 'Not found'}, status=status.HTTP_404_NOT_FOUND)
Am I incurring a performance cost by checking the existence and then selecting the related fields? Or is it small enough to be negligible?
Yes, it will query the database, but the minimum possible query.
As mentioned in the docs:
Returns True if the QuerySet contains any results, and False if not. This tries to perform the query in the simplest and fastest way possible, but it does execute nearly the same query as a normal QuerySet query.
and
Additionally, if a some_queryset has not yet been evaluated, but you know that it will be at some point, then using some_queryset.exists() will do more overall work (one query for the existence check plus an extra one to later retrieve the results) than simply using bool(some_queryset), which retrieves the results and then checks if any were returned.
Let's say I need to do some work both on a set of model objects, as well as a subset of the first set:
things = Thing.objects.filter(active=True)
for thing in things: # (1)
pass # do something with each `thing`
special_things = things.filter(special=True)
for thing in special_things: # (2)
pass # do something more with these things
My understanding is that at point (1) marked in the code above, an actual SQL query something like SELECT * FROM things_table WHERE active=1 will get executed against the database. The QuerySet documentation also says:
When a QuerySet is evaluated, it typically caches its results.
Now my question is, what happens at point (2) in the example Python code above?
Will Django execute a second SQL query, something like SELECT * FROM things_table WHERE active=1 AND special=1?
Or, will it use the cached result from earlier, automatically doing for me behind the scenes something like the more optimal filter(lambda d: d.special == True, things), i.e. avoiding a needless second trip to the database?
Either way, is the current behavior guaranteed (by documentation or something) or should I not rely on it? For example, it is not only a point of optimization, but could also make a possible logic difference if the database table is modified by another thread/process between the two potential queries.
It will execute a second SQL query. filter creates a new queryset, which doesn't copy the results cache.
As for guarantees - well, the docs specify that filter returns a new queryset object. I think you can be confident that that new queryset won't have cached results yet. As further support, the "when are querysets evaluated" docs suggest using .all() to get a new queryset if you want to pick up possibly changed results:
If the data in the database might have changed since a QuerySet was
evaluated, you can get updated results for the same query by calling
all() on a previously evaluated QuerySet.
I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?
This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related
I've cached a common queryset, which I would like to filter based off of different fields depending on the situation. I'm wondering if by filtering an evaluated queryset if I lose the advantage of caching it in the first place; does Django just create another queryset from scratch that's an aggregate of the querysets involved in creating the cached queryset and the filter that I apply afterwards?
Yes, the results get thrown out.
You can see this from the source: filter() calls _filter_or_exclude(), which calls _clone() and then adds to its query. _clone, you can see, doesn't set the _result_cache attribute.
In general, it's not really clear what it could possibly do to keep the common results. If it's a complicated query with a small result set, it could be replaced by just issuing SQL that checks that the primary key is one of the results you've found, but that's not always going to be more efficient, and in some situations it would confusingly mess with the semantics (if the DB changes in a way that affects the query results in the time between when it's cached and when you do the additional filter).
If you want to force this behavior of saving the IDs manually, you can do that:
pks = SomeObject.objects.filter(...).values_list('pk', flat=True)
some_of_them = SomeObject.objects.filter(pk_in=pks).filter(...)
others = SomeObject.objects.filter(pk_in=pks).filter(...)
You can also of course just do the filtering in Python, e.g. by
common = SomeObject.objects.filter(...)
some_of_them = [m for m in common if m.attribute == 'foo']
others = [m for m in common if m.other_attribute == 'bar']
(You could also use filter(lambda m: m.attribute == 'foo', common) if you preferred, or wrap the definition of common in list to be more explicit.)
Whether one of these or reissuing the query depends a lot on the size of the sets involved, the complexity of the filters, and what indices are present.
I am hoping someone can help me out with a quick question I have regarding chaining Django querysets. I am noticing a slow down because I am evaluating many data points in the database to create data trends. I was wondering if there was a way to have the chained filters evaluated locally instead of hitting the database. Here is a (crude) example:
pastries = Bakery.objects.filter(productType='pastry') # <--- will obviously always hit DB, when evaluated
cannoli = pastries.filter(specificType='cannoli') # <--- can this be evaluated locally instead of hitting the DB when evaluated, as long as pastries was evaluated?
I have checked the docs and I do not see anything specifying this, so I guess it's not possible, but I wanted to check with the 'braintrust' first ;-).
BTW - I know that I can do this myself by implementing some methods to loop through these datapoints and evaluate the criteria, but there are so many datapoints that my deadline does not permit me manually implementing this.
Thanks in advance.
QuerySet methods always produce SQL that returns the desired expression. This is why you cannot e.g. call various methods after slicing; SQL does not support that syntax. The ORM does nothing more than assemble said SQL. If you want fancier processing then you will need to perform it in Python code yourself.