Django QuerySet Limit hits database too many times?

Django QuerySet Limit hits database too many times? - django

I've recently stumbled upon a massive bottleneck on a production website only after updating from Django 1.11 > 2.1
Here is my simple slice of code;
pages = Page.objects.filter(cat="news_item").order_by('-created')[:2]
This in turn, creates around ~30 queries, around the number of pages under that specific filter.
I have now implemented a somewhat hacky way to resolve the 32 queries which i'm not satisfied with.
pages = [Page.objects.filter(cat='news_item').order_by('-created')[i] for i in range(0,2)]
Speed is notably effected, a few other chunks of code used this method which caused >400 queries per page load - I have since adapted these to use a combination of the above code & Model.objects.raw
Did something change in Django 2.0/2.1 that I missed or does the [:2] limit not work correctly?

Weirdest issue/bug/confusion I've ever seen.
Doing the following only queries once;
pages = Page.objects.filter(cat="news_item").order_by('-created')[:2:1]
I noted on the django documentation here that it states
https://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Generally, slicing a QuerySet returns a new QuerySet – it doesn’t evaluate the query. An exception is if you use the “step” parameter of Python slice syntax. For example, this would actually execute the query in order to return a list of every second object of the first 10:
Entry.objects.all()[:10:2]
So, using this weird trick above - it forces this basic piece of code to evaluate and query the database immediately, only causing one database query instead of 30+

Related

Optimized Django Queryset

I have the following function to determine who downloaded a certain book:
#cached_property
def get_downloader_info(self):
return self.downloaders.select_related('user').values(
'user__username', 'user__full_name')
Since I'm only using two fields, does it make sense to use .defer() on the remaining fields?
I tried to use .only(), but I get an error that some fields are not JSON serializable.
I'm open to all suggestions, if any, for optimizing this queryset.
Thank you!

Before you try every possible optimization, you should get your hands on the SQL query generated by the ORM (you can print it to stdout or use something like django debug toolbar) and see what is slow about it. After that I suggest you run that query with EXPLAIN ANALYZE and find out what is slow about that query. If the query is slow because lot of data has to be transfer than it makes lot of sense to use only or defer. Using only and defer (or values) gives you better performances only if you need to retrieve lot of data, but it does not make your database job much easier (unless you really have to read a lot of data of course).
Since you are using Django and Postgresql, you can get a psql session with manage.py dbshell and get query timings with \timing

Django's query result caching

I am creating a website using Django 1.7 and GeoDjango. I've hit the point when i need to optimize website speed.
One of the bottlenecks is query execution. There are some queries which run slowly even when optimized.
So i'd like to cache query results and store them in Redis.
The problem that i am getting is that i cannot cache some query results. Particularly the ones containing geometry types and distance calculations. I hit "TypeError: can't pickle Binary objects" error.
What is the recommended/right way of caching Django/GeoDjango QuerySets ?

Turns out the main problems in storing querysets are that:
QuerySets are lazy
To evaluate them you need to serialize them
[link]
Not all QuerySets can be serialized because Python's
serializer (Pickle) has it's own limitations [link]
The best solution i found is to cache query results in template.
So in my template "sample.html" i write something like:
{% cache 600 slow_query_results %}
<!-- result of page generation -->
{% endcache %}
And in view i do:
from django.core.cache import cache
from django.core.cache.utils import make_template_fragment_key
...
slow_query_results_key = make_template_fragment_key('slow_query_results')
if not cache.get(slow_query_results_key):
# return calculated result
slow_query_results = perform_some_slow_query()
This method is fine because data stored in cache is in expected text form. So there should be no problems/exceptions while storing data.
The main drawbacks are that:
Cache might contain repetitive similar data. This could happen when you cache html fragment containing language translation strings and so on. So under some circumstances you'll have to use language as a parameter for generating cache. And if you have translations for 2 languages you'll have 2 caches of the same data.
You'll have to invalidate cache in situations when you make changes to your html. This could become a real pain if the html in the block of code you are caching is continuously changing.
I personally think that problem 1) is not a big deal. The problem 2) could be avoided by good planning of site structure and knowing that you can do massive invalidation of cache keys in Redis. [link] This is possible because cache is stored in the following key format: ":1:template.cache.slow_query.8a5b358dfc28a6bc1b3397e398d28b66"
So it should be possible to delete all cache keys related to some caching block.

Datastore NDB best practices when querying and extracting thousands of rows

I'm using the High Replication Datastore, along with ndb. I have a kind with over 27,000 entities, which isn't that much. Supposedly the datastore is efficient in querying and extracting large amounts of data, but whenever I query over that kind, queries take a long time to finish (I've even got DeadlineExceededErrors).
I have a model where I store keywords and URLs I want to index in Google:
class Keywords(ndb.Model):
keyword = ndb.StringProperty(indexed=True)
url = ndb.StringProperty(indexed=True)
number_articles = ndb.IntegerProperty(indexed=True)
# Some other attributes... All attributes are indexed
My current use cases are to build my Sitemap, and to fetch my top 20 keywords to link from my hope page.
When I fetch many entities, I usually do:
Keywords.query().fetch() # For the sitemap, as I want all of the urls
Keywords.query(Keywords.number_articles > 5).fetch() # For the homepage, I want to link to keywords with more than 5 articles
Is there a better way to extract data?
I've tried to index data into the Search API, and I've seen huge speed gains. Even though this works, I don't think it's ideal to replicate data from the Datastore into Search API with basically the same fields.
Thanks in advance!

I would split this functionality.
For home page you can use your second query, but add, as advised by Bruyere, limit=20 paramater. Such request should run very fast, if you have the right index.
The site map is a bigger issue. Usually, to process large number of entities, you use Map reduce.
It's probably a good idea, but only if you don't have too many requests to sitemap. It can also be the only solution if you update Keywords entities often and want as up to date site map as possible.
Another option can be to generate sitemap in a task, save it as a blob and serve this blob in the request. That is really quick. If your updates to the Keywords entities are not very frequent, then you can run this task after any update. If you have many updates, then you can schedule the task to run periodically in cron. As you have success using search API, then this is probably the best option for you.
Generally speaking I don't think it's a good idea to use datastore to retrieve large amounts of data. I recommend to look at least at Datastore comparison with traditional databases. It's designed to handle large databases, but not necessarily large result sets. I would say that datastore is designed to handle large amounts of small requests.

DB speed is related to the number of results returned, not the number of records in the DB. You say:
to build my Sitemap, and to fetch my top 20 keywords
If thats the case add limit=20 in both fetches. If you do it that way then use run instead as per the docs:
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

Slow page generation in Django with 50+ sql queries per page

In my Django app I noticed that pages with big number of sql queries load considerably slower than other pages. I'm not a first day in web dev and mainly I have a deal with such a resource hog as Drupal, but even Drupal with its 150 - 200 sql queries per page generates page in 0.5 - 0.7 sec.
Django from the other side, performs really bad with more or less average number of queries per page. For example, one of my pages generates 60 queries like this:
SELECT`gamenode_gamenode`.`id`, `gamenode_gamenode`.`title`, `gamenode_gamenode`.`short_desc`, `gamenode_gamenode`.`full_desc`, `gamenode_gamenode`.`slug`, `gamenode_gamenode`.`type`, `gamenode_gamenode`.`source_gameid`, `gamenode_gamenode`.`created`, `gamenode_gamenode`.`updated`, `gamenode_gamenode`.`status`, `gamenode_gamenode`.`promote`, `gamenode_gamenode`.`sticky`, `gamenode_gamenode`.`hit_count`, `gamenode_gamenode`.`game_rank`, `gamenode_gamenode`.`share_count`, `gamenode_gamenode`.`like_count`, `gamenode_gamenode`.`comment_count` FROM `gamenode_gamenode` WHERE `gamenode_gamenode`.`id` = 1058
and outputs the data as a simple string and it takes 1200ms to generate a page! I did this just for a test to generate many fairly simple queries. If I lower the number of queries to 10 - 15, page generation time will come back to more or less acceptable number.
So I have a question, why Django is so slow when there are many sql queries on the page? I did similar comparisons by using Rails, Symfony and Drupal and all these "resource hogs" performed way better than Django. Am I doing something wrong or there's some "secret" setting to make things faster in Django or, maybe, Djangonauts consider such times as normal and just strive to write code which produces as few queries as possible? Please help me to figure this out.

Yes, Django's ORM is pretty slow. You have three choices for dealing with this:
Complain about it.
Switch to another web application framework.
Make some effort to understand why your application is generating so many database queries, and learn how to use Django's ORM effectively so as to reduce the number of queries.
(1) might be psychologically satisfying but won't solve your problem; (2) is off-topic here at Stack Overflow (but you might look at Wikipedia's Comparison of web application frameworks).
We can help you with (3), but only if you show us some more of your code. The query you quoted looks like a typical query that Django would generate for a call to get():
GameNode.objects.get(id = 1058)
You shouldn't be running more than a couple of queries like this on a page: if you want to get many GameNodes you need to get them in a single query:
GameNode.objects.filter(<criteria>)
Or if the GameNode objects are related to some other object by a foreign key on another model that you are querying, then you could fetch all the related GameNode objects by using Django's select_related() method.
There's almost always a way to speed things up (see this testimonial) but we need to know the details before we can say how to do it.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.

This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals

This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.

Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.

Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.

For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.

You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.

Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.

The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.

To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)

From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js