Django's query result caching

Django's query result caching - django

I am creating a website using Django 1.7 and GeoDjango. I've hit the point when i need to optimize website speed.
One of the bottlenecks is query execution. There are some queries which run slowly even when optimized.
So i'd like to cache query results and store them in Redis.
The problem that i am getting is that i cannot cache some query results. Particularly the ones containing geometry types and distance calculations. I hit "TypeError: can't pickle Binary objects" error.
What is the recommended/right way of caching Django/GeoDjango QuerySets ?

Turns out the main problems in storing querysets are that:
QuerySets are lazy
To evaluate them you need to serialize them
[link]
Not all QuerySets can be serialized because Python's
serializer (Pickle) has it's own limitations [link]
The best solution i found is to cache query results in template.
So in my template "sample.html" i write something like:
{% cache 600 slow_query_results %}
<!-- result of page generation -->
{% endcache %}
And in view i do:
from django.core.cache import cache
from django.core.cache.utils import make_template_fragment_key
...
slow_query_results_key = make_template_fragment_key('slow_query_results')
if not cache.get(slow_query_results_key):
# return calculated result
slow_query_results = perform_some_slow_query()
This method is fine because data stored in cache is in expected text form. So there should be no problems/exceptions while storing data.
The main drawbacks are that:
Cache might contain repetitive similar data. This could happen when you cache html fragment containing language translation strings and so on. So under some circumstances you'll have to use language as a parameter for generating cache. And if you have translations for 2 languages you'll have 2 caches of the same data.
You'll have to invalidate cache in situations when you make changes to your html. This could become a real pain if the html in the block of code you are caching is continuously changing.
I personally think that problem 1) is not a big deal. The problem 2) could be avoided by good planning of site structure and knowing that you can do massive invalidation of cache keys in Redis. [link] This is possible because cache is stored in the following key format: ":1:template.cache.slow_query.8a5b358dfc28a6bc1b3397e398d28b66"
So it should be possible to delete all cache keys related to some caching block.

Related

Django QuerySet Limit hits database too many times?

I've recently stumbled upon a massive bottleneck on a production website only after updating from Django 1.11 > 2.1
Here is my simple slice of code;
pages = Page.objects.filter(cat="news_item").order_by('-created')[:2]
This in turn, creates around ~30 queries, around the number of pages under that specific filter.
I have now implemented a somewhat hacky way to resolve the 32 queries which i'm not satisfied with.
pages = [Page.objects.filter(cat='news_item').order_by('-created')[i] for i in range(0,2)]
Speed is notably effected, a few other chunks of code used this method which caused >400 queries per page load - I have since adapted these to use a combination of the above code & Model.objects.raw
Did something change in Django 2.0/2.1 that I missed or does the [:2] limit not work correctly?

Weirdest issue/bug/confusion I've ever seen.
Doing the following only queries once;
pages = Page.objects.filter(cat="news_item").order_by('-created')[:2:1]
I noted on the django documentation here that it states
https://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Generally, slicing a QuerySet returns a new QuerySet – it doesn’t evaluate the query. An exception is if you use the “step” parameter of Python slice syntax. For example, this would actually execute the query in order to return a list of every second object of the first 10:
Entry.objects.all()[:10:2]
So, using this weird trick above - it forces this basic piece of code to evaluate and query the database immediately, only causing one database query instead of 30+

Finding when a django cache was set

I'm trying to implement intelligent cache invalidation in Django for my app with an algorithm of the sort:
page = get_cache_for_item(item_pk + some_key)
cache_set_at = page.SOMETHING_HERE
modified = models.Object.filter(pk=item_pk,modified__gt=cache_set_at).exists() #Cheap call
if modified:
page = get_from_database_and_build_slowly(item_pk)
set_cache_for_item(item_pk + some_key)
return page
The intent, is I want to do a quick call to the database to get the modified time, and if and only if the page was modified since the cache was set, build the page using the resource and database intensive page.
Unfortunately, I can't figure out how to get the time a cache was set at at the step SOMETHING_HERE.
Is this possible?

Django does not seem to store that information. The information is (if stored) in the cache implementation of your choice.
This is for example, the way Django stores a key in memcached.
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
key = self.make_key(key, version=version)
if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
# make sure the key doesn't keep its old value in case of failure to set (memcached's 1MB limit)
self._cache.delete(key)
Django does not store the creation time and lets the cache handle the timeout. So if any, you should look into the cache of your choice. I know that Redis, for example, does not store that value either, so you will not be able to make it work at all with redis, even if u bypass Django's cache and look into Redis.
I think your best choice is to store the key yourself somehow. You can maybe override the #cache_page or simply create an improved #smart_cache_page and store the timestamp of creation there.
EDIT:
There might be other easier ways to achieve that. You could use post_save signals. Something like this: Expire a view-cache in Django?
Read carefully through it since the implementation depends on your Django version.

Datastore NDB best practices when querying and extracting thousands of rows

I'm using the High Replication Datastore, along with ndb. I have a kind with over 27,000 entities, which isn't that much. Supposedly the datastore is efficient in querying and extracting large amounts of data, but whenever I query over that kind, queries take a long time to finish (I've even got DeadlineExceededErrors).
I have a model where I store keywords and URLs I want to index in Google:
class Keywords(ndb.Model):
keyword = ndb.StringProperty(indexed=True)
url = ndb.StringProperty(indexed=True)
number_articles = ndb.IntegerProperty(indexed=True)
# Some other attributes... All attributes are indexed
My current use cases are to build my Sitemap, and to fetch my top 20 keywords to link from my hope page.
When I fetch many entities, I usually do:
Keywords.query().fetch() # For the sitemap, as I want all of the urls
Keywords.query(Keywords.number_articles > 5).fetch() # For the homepage, I want to link to keywords with more than 5 articles
Is there a better way to extract data?
I've tried to index data into the Search API, and I've seen huge speed gains. Even though this works, I don't think it's ideal to replicate data from the Datastore into Search API with basically the same fields.
Thanks in advance!

I would split this functionality.
For home page you can use your second query, but add, as advised by Bruyere, limit=20 paramater. Such request should run very fast, if you have the right index.
The site map is a bigger issue. Usually, to process large number of entities, you use Map reduce.
It's probably a good idea, but only if you don't have too many requests to sitemap. It can also be the only solution if you update Keywords entities often and want as up to date site map as possible.
Another option can be to generate sitemap in a task, save it as a blob and serve this blob in the request. That is really quick. If your updates to the Keywords entities are not very frequent, then you can run this task after any update. If you have many updates, then you can schedule the task to run periodically in cron. As you have success using search API, then this is probably the best option for you.
Generally speaking I don't think it's a good idea to use datastore to retrieve large amounts of data. I recommend to look at least at Datastore comparison with traditional databases. It's designed to handle large databases, but not necessarily large result sets. I would say that datastore is designed to handle large amounts of small requests.

DB speed is related to the number of results returned, not the number of records in the DB. You say:
to build my Sitemap, and to fetch my top 20 keywords
If thats the case add limit=20 in both fetches. If you do it that way then use run instead as per the docs:
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

Slow page generation in Django with 50+ sql queries per page

In my Django app I noticed that pages with big number of sql queries load considerably slower than other pages. I'm not a first day in web dev and mainly I have a deal with such a resource hog as Drupal, but even Drupal with its 150 - 200 sql queries per page generates page in 0.5 - 0.7 sec.
Django from the other side, performs really bad with more or less average number of queries per page. For example, one of my pages generates 60 queries like this:
SELECT`gamenode_gamenode`.`id`, `gamenode_gamenode`.`title`, `gamenode_gamenode`.`short_desc`, `gamenode_gamenode`.`full_desc`, `gamenode_gamenode`.`slug`, `gamenode_gamenode`.`type`, `gamenode_gamenode`.`source_gameid`, `gamenode_gamenode`.`created`, `gamenode_gamenode`.`updated`, `gamenode_gamenode`.`status`, `gamenode_gamenode`.`promote`, `gamenode_gamenode`.`sticky`, `gamenode_gamenode`.`hit_count`, `gamenode_gamenode`.`game_rank`, `gamenode_gamenode`.`share_count`, `gamenode_gamenode`.`like_count`, `gamenode_gamenode`.`comment_count` FROM `gamenode_gamenode` WHERE `gamenode_gamenode`.`id` = 1058
and outputs the data as a simple string and it takes 1200ms to generate a page! I did this just for a test to generate many fairly simple queries. If I lower the number of queries to 10 - 15, page generation time will come back to more or less acceptable number.
So I have a question, why Django is so slow when there are many sql queries on the page? I did similar comparisons by using Rails, Symfony and Drupal and all these "resource hogs" performed way better than Django. Am I doing something wrong or there's some "secret" setting to make things faster in Django or, maybe, Djangonauts consider such times as normal and just strive to write code which produces as few queries as possible? Please help me to figure this out.

Yes, Django's ORM is pretty slow. You have three choices for dealing with this:
Complain about it.
Switch to another web application framework.
Make some effort to understand why your application is generating so many database queries, and learn how to use Django's ORM effectively so as to reduce the number of queries.
(1) might be psychologically satisfying but won't solve your problem; (2) is off-topic here at Stack Overflow (but you might look at Wikipedia's Comparison of web application frameworks).
We can help you with (3), but only if you show us some more of your code. The query you quoted looks like a typical query that Django would generate for a call to get():
GameNode.objects.get(id = 1058)
You shouldn't be running more than a couple of queries like this on a page: if you want to get many GameNodes you need to get them in a single query:
GameNode.objects.filter(<criteria>)
Or if the GameNode objects are related to some other object by a foreign key on another model that you are querying, then you could fetch all the related GameNode objects by using Django's select_related() method.
There's almost always a way to speed things up (see this testimonial) but we need to know the details before we can say how to do it.

Invalidating a path from the Django cache recursively

I am deleting a single path from the Django cache like this:
from models import Graph
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.db.models.signals import post_save
from django.core.cache import cache
def expire_page(path):
request = HttpRequest()
request.path = path
key = get_cache_key(request)
if cache.has_key(key):
cache.delete(key)
def invalidate_cache(sender, instance, **kwargs):
expire_page(instance.get_absolute_url())
post_save.connect(invalidate_cache, sender = Graph)
This works - but is there a way to delete recursively? My paths look like this:
/graph/123
/graph/123/2009-08-01/2009-10-21
Whenever the graph with id "123" is saved, the cache for both paths needs to be invalidated. Can this be done?

You might want to consider employing a generational caching strategy, it seems like it would fit what you are trying to accomplish. In the code that you have provided, you would store a "generation" number for each absolute url. So for example you would initialize the "/graph/123" to have a generation of one, then its cache key would become something like "/GENERATION/1/graph/123". When you want to expire the cache for that absolute url you increment its generation value (to two in this case). That way, the next time someone goes to look up "/graph/123" the cache key becomes "/GENERATION/2/graph/123". This also solves the issue of expiring all the sub pages since they should be referring to the same cache key as "/graph/123".
Its a bit tricky to understand at first but it is a really elegant caching strategy which if done correctly means you never have to actually delete anything from cache. For more information here is a presentation on generational caching, its for Rails but the concept is the same, regardless of language.

Another option is to use a cache that supports tagging keys and evicting keys by tag. Django's built-in cache API does not have support for this approach. But at least one cache backend (not part of Django proper) does have support.
DiskCache* is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django. To use DiskCache in your project simply install it and configure your CACHES setting.
Installation is easy with pip:
$ pip install diskcache
Then configure your CACHES setting:
CACHES = {
'default': {
'BACKEND': 'diskcache.DjangoCache',
'LOCATION': '/tmp/path/to/directory/',
}
}
The cache set method is extended by an optional tag keyword argument like so:
from django.core.cache import cache
cache.set('/graph/123', value, tag='/graph/123')
cache.set('/graph/123/2009-08-01/2009-10-21', other_value, tag='/graph/123')
diskcache.DjangoCache uses a diskcache.FanoutCache internally. The corresponding FanoutCache is accessible through the _cache attribute and exposes an evict method. To evict all keys tagged with /graph/123 simply:
cache._cache.evict('/graph/123')
Though it may feel awkward to access an underscore-prefixed attribute, the DiskCache project is stable and unlikely to make significant changes to the DjangoCache implementation.
The Django cache benchmarks page has a discussion of alternative cache backends.
Disclaimer: I am the original author of the DiskCache project.

Checkout shutils.rmtree() or os.removedirs(). I think the first is probably what you want.
Update based on several comments: Actually, the Django caching mechanism is more general and finer-grained than just using the path for the key (although you can use it at that level). We have some pages that have 7 or 8 separately cached subcomponents that expire based on a range of criteria. Our component cache names reflect the key objects (or object classes) and are used to identify what needs to be invalidated on certain updates.
All of our pages have an overall cache-key based on member/non-member status, but that is only about 95% of the page. The other 5% can change on a per-member basis and so is not cached at all.
How you iterate through your cache to find invalid items is a function of how it's actually stored. If it's files you can use simply globs and/or recursive directory deletes, if it's some other mechanism then you'll have to use something else.
What my answer, and some of the comments by others, are trying to say is that how you accomplish cache invalidation is intimately tied to how you are using/storing the cache.
Second Update: #andybak: So I guess your comment means that all of my commercial Django sites are going to explode in flames? Thanks for the heads up on that. I notice you did not attempt an answer to the problem.
Knipknap's problem is that he has a group of cache items that appear to be related and in a hierarchy because of their names, but the key-generation logic of the cache mechanism obliterates that name by creating an MD5 hash of the path + vary_on. Since there is no trace of the original path/params you will have to exhaustively guess all possible path/params combinations, hoping you can find the right group. I have other hobbies that are more interesting.
If you wish to be able to find groups of cached items based on some combination of path and/or parameter values you must either use cache keys that can be pattern matched directly or some system that retains this information for use at search time.
Because we had needs not-unrelated to the OP's problem, we took control of template fragment caching -- and specifically key generation -- over 2 years ago. It allows us to use regexps in a number of ways to efficiently invalidate groups of related cached items. We also added a default timeout and vary_on variable names (resolved at run time) configurable in settings.py, changed the ordering of name & timeout because it made no sense to always have to override the default timeout in order to name the fragment, made the fragment_name resolvable (ie. it can be a variable) to work better with a multi-level template inheritance scheme, and a few other things.
The only reason for my initial answer, which was indeed wrong for current Django, was because I have been using saner cache keys for so long I literally forgot the simple mechanism we walked away from.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js