Invalidating a path from the Django cache recursively - django

I am deleting a single path from the Django cache like this:
from models import Graph
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.db.models.signals import post_save
from django.core.cache import cache
def expire_page(path):
request = HttpRequest()
request.path = path
key = get_cache_key(request)
if cache.has_key(key):
cache.delete(key)
def invalidate_cache(sender, instance, **kwargs):
expire_page(instance.get_absolute_url())
post_save.connect(invalidate_cache, sender = Graph)
This works - but is there a way to delete recursively? My paths look like this:
/graph/123
/graph/123/2009-08-01/2009-10-21
Whenever the graph with id "123" is saved, the cache for both paths needs to be invalidated. Can this be done?

You might want to consider employing a generational caching strategy, it seems like it would fit what you are trying to accomplish. In the code that you have provided, you would store a "generation" number for each absolute url. So for example you would initialize the "/graph/123" to have a generation of one, then its cache key would become something like "/GENERATION/1/graph/123". When you want to expire the cache for that absolute url you increment its generation value (to two in this case). That way, the next time someone goes to look up "/graph/123" the cache key becomes "/GENERATION/2/graph/123". This also solves the issue of expiring all the sub pages since they should be referring to the same cache key as "/graph/123".
Its a bit tricky to understand at first but it is a really elegant caching strategy which if done correctly means you never have to actually delete anything from cache. For more information here is a presentation on generational caching, its for Rails but the concept is the same, regardless of language.

Another option is to use a cache that supports tagging keys and evicting keys by tag. Django's built-in cache API does not have support for this approach. But at least one cache backend (not part of Django proper) does have support.
DiskCache* is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django. To use DiskCache in your project simply install it and configure your CACHES setting.
Installation is easy with pip:
$ pip install diskcache
Then configure your CACHES setting:
CACHES = {
'default': {
'BACKEND': 'diskcache.DjangoCache',
'LOCATION': '/tmp/path/to/directory/',
}
}
The cache set method is extended by an optional tag keyword argument like so:
from django.core.cache import cache
cache.set('/graph/123', value, tag='/graph/123')
cache.set('/graph/123/2009-08-01/2009-10-21', other_value, tag='/graph/123')
diskcache.DjangoCache uses a diskcache.FanoutCache internally. The corresponding FanoutCache is accessible through the _cache attribute and exposes an evict method. To evict all keys tagged with /graph/123 simply:
cache._cache.evict('/graph/123')
Though it may feel awkward to access an underscore-prefixed attribute, the DiskCache project is stable and unlikely to make significant changes to the DjangoCache implementation.
The Django cache benchmarks page has a discussion of alternative cache backends.
Disclaimer: I am the original author of the DiskCache project.

Checkout shutils.rmtree() or os.removedirs(). I think the first is probably what you want.
Update based on several comments: Actually, the Django caching mechanism is more general and finer-grained than just using the path for the key (although you can use it at that level). We have some pages that have 7 or 8 separately cached subcomponents that expire based on a range of criteria. Our component cache names reflect the key objects (or object classes) and are used to identify what needs to be invalidated on certain updates.
All of our pages have an overall cache-key based on member/non-member status, but that is only about 95% of the page. The other 5% can change on a per-member basis and so is not cached at all.
How you iterate through your cache to find invalid items is a function of how it's actually stored. If it's files you can use simply globs and/or recursive directory deletes, if it's some other mechanism then you'll have to use something else.
What my answer, and some of the comments by others, are trying to say is that how you accomplish cache invalidation is intimately tied to how you are using/storing the cache.
Second Update: #andybak: So I guess your comment means that all of my commercial Django sites are going to explode in flames? Thanks for the heads up on that. I notice you did not attempt an answer to the problem.
Knipknap's problem is that he has a group of cache items that appear to be related and in a hierarchy because of their names, but the key-generation logic of the cache mechanism obliterates that name by creating an MD5 hash of the path + vary_on. Since there is no trace of the original path/params you will have to exhaustively guess all possible path/params combinations, hoping you can find the right group. I have other hobbies that are more interesting.
If you wish to be able to find groups of cached items based on some combination of path and/or parameter values you must either use cache keys that can be pattern matched directly or some system that retains this information for use at search time.
Because we had needs not-unrelated to the OP's problem, we took control of template fragment caching -- and specifically key generation -- over 2 years ago. It allows us to use regexps in a number of ways to efficiently invalidate groups of related cached items. We also added a default timeout and vary_on variable names (resolved at run time) configurable in settings.py, changed the ordering of name & timeout because it made no sense to always have to override the default timeout in order to name the fragment, made the fragment_name resolvable (ie. it can be a variable) to work better with a multi-level template inheritance scheme, and a few other things.
The only reason for my initial answer, which was indeed wrong for current Django, was because I have been using saner cache keys for so long I literally forgot the simple mechanism we walked away from.

Related

Django: Making sure a complex object is accessible throughout multiple view calls

for a project, I am trying to create a web-app that, among other things, allows training of machine learning agents using python libraries such as Dedupe or TensorFlow. In cases such as Dedupe, I need to provide an interface for active learning, which I currently realize through jquery based ajax calls to a view that takes and sends the necessary training data.
The problem is that I need this agent object to stay alive throughout multiple view calls and be accessible by each individual call. I have tried realizing this via the built-in cache system using Memcached, but the serialization does not seem to keep all the info intact, and while I am technically able to restore the object from the cache, this appears to break the training algorithm.
Essentially, I want to keep the object alive within the application itself (rather than an external memory store) and be able to access it from another view, but I am at a bit of a loss of how to realize this.
If someone knows the proper technique to achieve this, I would be very grateful.
Thanks in advance!
To follow up with this question, I have since realized that the behavior shown seemed to have been an effect of trying to use the result of a method call from the object loaded from cache directly in the return properties of a view. Specifically, my code looked as follows:
#model is the object loaded from cache
#this returns the wrong object (same object as on an earlier call)
return JsonResponse({"pairs": model.uncertain_pairs()})
and was changed to the following
#model is the object loaded from cache
#this returns the correct object (calls and returns the model.uncertain_pairs() method properly)
uncertain = model.uncertain_pairs()
return JsonResponse({"pairs": uncertain})
I am unsure if this specifically happens due to an implementation from Dedupe or Django side or due to Python, but this has undoubtedly fixed the issue.
To return back to the question, Django does seem to be able to properly (de-)serialize objects and their properties in cache, as long as the cache is set up properly (see Apparent bug storing large keys in django memcached which I also had to deal with)

Finding when a django cache was set

I'm trying to implement intelligent cache invalidation in Django for my app with an algorithm of the sort:
page = get_cache_for_item(item_pk + some_key)
cache_set_at = page.SOMETHING_HERE
modified = models.Object.filter(pk=item_pk,modified__gt=cache_set_at).exists() #Cheap call
if modified:
page = get_from_database_and_build_slowly(item_pk)
set_cache_for_item(item_pk + some_key)
return page
The intent, is I want to do a quick call to the database to get the modified time, and if and only if the page was modified since the cache was set, build the page using the resource and database intensive page.
Unfortunately, I can't figure out how to get the time a cache was set at at the step SOMETHING_HERE.
Is this possible?
Django does not seem to store that information. The information is (if stored) in the cache implementation of your choice.
This is for example, the way Django stores a key in memcached.
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
key = self.make_key(key, version=version)
if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
# make sure the key doesn't keep its old value in case of failure to set (memcached's 1MB limit)
self._cache.delete(key)
Django does not store the creation time and lets the cache handle the timeout. So if any, you should look into the cache of your choice. I know that Redis, for example, does not store that value either, so you will not be able to make it work at all with redis, even if u bypass Django's cache and look into Redis.
I think your best choice is to store the key yourself somehow. You can maybe override the #cache_page or simply create an improved #smart_cache_page and store the timestamp of creation there.
EDIT:
There might be other easier ways to achieve that. You could use post_save signals. Something like this: Expire a view-cache in Django?
Read carefully through it since the implementation depends on your Django version.

Running multiple sites on the same python process

In our company we make news portals for a pretty big number of local newspapers (currently 13, going to 30 next month and more in the future), each with 2k to 100k page views/day. Since we are evolving from a situation where each site was heavily customized to one where each difference is a matter of configuration or custom template, our software is already pretty much the same for all sites. Right now our deployment strategy is one gunicorn instance for each site (with 1-17 workers each, depending on the site traffic), on a 16-core server and 12GB RAM. The problem with this setup is that each worker (regular pre-forked gunicorn) takes 110MB, whether its being used or not. Now with the new sites we would need to add more RAM to serve not that much many requests, so basically it doesn't scale. Also, since we are moving from this model where each site is independent, each site has its own database and I quite like it that way, especially since we are using relational databases (mysql, but migrating to pgsql), so its much easier to shard this way.
I'm doing some research and experimenting with running all sites on one gunicorn instance, so I could use the servers fully and add more servers behind a load balancer when it came to it. The problem is that django assumes in a lot of places that only one site is running per process, so for what I've thought of so far I'd have to implement:
A middleware that takes the HTTP_HOST from the request and places an identifier on a threadlocal variable.
A template loader that uses that variable to load custom templates accordingly.
Monkey patch django.db.model.Model, probably adding a metaclass (not even sure that's possible, but I think I would need it because of the custom managers we sometimes need to use) that would overwrite the managers for one that would first call db_manager(identifier) on the original manager and then call the intended method. I would also need to overwrite the save and delete methods to always include the using=identifier parameter.
I guess I would need to stop using inclusion_tag decorators, not a big problem, but I need to think of other cases like this.
Heavy and ugly patching of urlresolvers if I need custom or extra urls for each site. I don't need them now, but probably will at some point.
And this is just is what I came up with without even implementing it and seeing where it breaks, I'm sure I'd need many more changes for it to work. So I really don't want to do it, especially with the extra maintenance effort I'll need, but I don't see any alternatives and would love to learn that someone already solved this in a better way. Of course I could also stop using django altogether (I already have many reasons to do so) but that would mean a major rewrite and having two maintain two incompatible branches of the software until the new one reached feature parity with the django version, so to me it seems even worse than all the ugly hacks.
I've recently developed an e-commerce system with similar requirements -- many instances running from the same project sharing almost everything. The previous version of the system was a bunch of independent installations (~30) so it was pretty unmaintainable. I'm sure the requirements still differ from yours (for example, all instances shared the same models in my case), but it still might be useful to share my experience.
You are right that Django doesn't help with scenarios like this out of the box, but it's actually surprisingly easy to work it around. Here is a brief description of what I did.
I could see a synergy between what I wanted to achieve and django.contrib.sites. Also because many third-party Django apps out there know how to work with it and use it, for example, to generate absolute URLs to the current site. The major problem with sites is that it wants you to specify the current site id in settings.SITE_ID, which a very naive approach to the multi host problem. What one naturally wants, and what you also mention, is to determine the current site from the Host request header. To fix this problem, I borrowed the hook idea from django-multisite: https://github.com/shestera/django-multisite/blob/master/multisite/threadlocals.py#L19
Next I created an app encapsulating all the functionality related to the multi host aspect of my project. In my case the app was called stores and among other things it featured two important classes: stores.middleware.StoreMiddleware and stores.models.Store.
The model class is a subclass of django.contrib.sites.models.Site. The good thing about subclassing Site is that you can pass a Store to any function where a Site is expected. So you are effectively still just using the old, well documented and tested sites framework. To the Store class I added all the fields needed to configure all the different stores. So it's got fields like urlconf, theme, robots_txt and whatnot.
The middleware class' function was to match the Host header with the corresponding Store instance in the database. Once the matching Store was retrieved, It would patch the SITE_ID in a way similar to https://github.com/shestera/django-multisite/blob/master/multisite/middleware.py. Also, it looked at the store's urlconf and if it was not None, it would set request.urlconf to apply its special URL requirements. After that, the current Store instance was stored in request.store. This has proven to be incredibly useful, because I was able to do things like this in my views:
def homepage(request):
featured = Product.objects.filter(featured=True, store=request.store)
...
request.store became a natural additional dimension of the request object throughout the project for me.
Another thing that was defined on the Store class was a function get_absolute_url whose implementation looked roughly like this:
def get_absolute_url(self, to='/'):
"""
Return an absolute url to this `Store` or to `to` on this store.
The URL includes http:// and the domain name of the store.
`to` can be an object with `get_absolute_url()` or an absolute path as string.
"""
if isinstance(to, basestring):
path = to
elif hasattr(to, 'get_absolute_url'):
path = to.get_absolute_url()
else:
raise ValueError(
'Invalid argument (need a string or an object with get_absolute_url): %s' % to
)
url = 'http://%s%s%s' % (
self.domain,
# This setting allowed for a sane development environment
# where I just set it to ".dev:8000" and configured `dnsmasq`.
# The same value was also removed from the `Host` value in the middleware
# before looking up the `Store` in database.
settings.DOMAIN_SUFFIX,
path
)
return url
So I could easily generate URLs to objects on other than the current store, e.g.:
# Redirect to `product` on `store`.
redirect(store.get_absolute_url(product))
This was basically all I needed to be able to implement a system allowing users to create a new e-shop living on its own domain via the Django admin.

Django Caching - How to generate custom key names?

Right now, I am retrieving information from an API, and I would like to cache the information I get back, so I do not have to constantly hit their server and use up my max API call requests. Right now, a user can search up a particular keyword, like "grapes", I would like to cache the retrieved string by calling "cache.set(search_result, info_retrieved, 600)" where "search_result" is the user's search result, in this case, "grapes". I want the key to be the user's search result, which is "grapes". I cannot do this since the cache requires the key to be a string. How can I get around this? I cannot use a database because the information updates too often.
I could use a database, but I would be writing information to it, then deleting it after a few minutes, which seems impractical. So, I just want to cache it temporarily.
As Shawn Chin mentioned, you should already have a string "version" of your search query, which would work just fine as a cache key.
One limitation with memcached (not sure about other backends) is that certain characters (notably, spaces) are not allowed in keys. The easiest way to get around this is to hash your string key into a hex digest and use that as a key:
from hashlib import sha1
key = sha1('grapes').hexdigest() # '35c4cdb50a9a6b4475da4a66d955ef2a9e1acc39'
If you might have different results for different users (or based on whatever criteria), you can tag/salt/flavor the key with a string representation of that information:
from hashlib import sha1
key = sha1('%s:%s:%s' % (user.id, session.sessionid, 'grapes')).hexdigest()
You could also use django-newcache:
Newcache is an improved memcached cache backend for Django. It provides four major advantages over Django's built-in cache backend:
It supports pylibmc.
It allows for a function to be run on each key before it's sent to memcached.
It supports setting cache keys with infinite timeouts.
It mitigates the thundering herd problem.
It also has some pretty nice defaults. By default, the function that's run on each key is one that hashes, versions, and flavors the key. More on that later.

Practical rules for Django MiddleWare ordering?

The official documentation is a bit messy: 'before' & 'after' are used for ordering MiddleWare in a tuple, but in some places 'before'&'after' refers to request-response phases. Also, 'should be first/last' are mixed and it's not clear which one to use as 'first'.
I do understand the difference.. however it seems to complicated for a newbie in Django.
Can you suggest some correct ordering for builtin MiddleWare classes (assuming we enable all of them) and — most importantly — explain WHY one goes before/after other ones?
here's the list, with the info from docs I managed to find:
UpdateCacheMiddleware
Before those that modify 'Vary:' SessionMiddleware, GZipMiddleware, LocaleMiddleware
GZipMiddleware
Before any MW that may change or use the response body
After UpdateCacheMiddleware: Modifies 'Vary:'
ConditionalGetMiddleware
Before CommonMiddleware: uses its 'Etag:' header when USE_ETAGS=True
SessionMiddleware
After UpdateCacheMiddleware: Modifies 'Vary:'
Before TransactionMiddleware: we don't need transactions here
LocaleMiddleware, One of the topmost, after SessionMiddleware, CacheMiddleware
After UpdateCacheMiddleware: Modifies 'Vary:'
After SessionMiddleware: uses session data
CommonMiddleware
Before any MW that may change the response (it calculates ETags)
After GZipMiddleware so it won't calculate an E-Tag on gzipped contents
Close to the top: it redirects when APPEND_SLASH or PREPEND_WWW
CsrfViewMiddleware
Before any view middleware that assumes that CSRF attacks have been dealt with
AuthenticationMiddleware
After SessionMiddleware: uses session storage
MessageMiddleware
After SessionMiddleware: can use Session-based storage
XViewMiddleware
TransactionMiddleware
After MWs that use DB: SessionMiddleware (configurable to use DB)
All *CacheMiddleWare is not affected (as an exception: uses own DB cursor)
FetchFromCacheMiddleware
After those those that modify 'Vary:' if uses them to pick a value for cache hash-key
After AuthenticationMiddleware so it's possible to use CACHE_MIDDLEWARE_ANONYMOUS_ONLY
FlatpageFallbackMiddleware
Bottom: last resort
Uses DB, however, is not a problem for TransactionMiddleware (yes?)
RedirectFallbackMiddleware
Bottom: last resort
Uses DB, however, is not a problem for TransactionMiddleware (yes?)
(I will add suggestions to this list to collect all of them in one place)
The most difficult part is that you have to consider both directions at the same time when setting the order. I would say that's a flaw in the design and I personally would opt for a separate request and response middleware order (so you wouldn't need hacks like FetchFromCacheMiddleware and UpdateCacheMiddleware).
But... alas, it's this way right now.
Either way, the idea of it all is that your request passes through the list of middlewares in top-down order for process_request and process_view. And it passes your response through process_response and process_exception in reverse order.
With UpdateCacheMiddleware this means that any middleware that changes the Vary headers in the HTTP request should come before it. If you change the order here than it would be possible for some user to get a cached page for some other user.
How can you find out if the Vary header is changed by a middleware? You can either hope that there are docs available, or simply look at the source. It's usually quite obvious :)
One tip that can save your hair is to put TransactionMiddleware in such place on the list, in which it isn't able to rollback changes commited to the database by other middlewares, which changes should be commited no matter if view raised an exception or not.