Django Caching - How to generate custom key names? - django

Right now, I am retrieving information from an API, and I would like to cache the information I get back, so I do not have to constantly hit their server and use up my max API call requests. Right now, a user can search up a particular keyword, like "grapes", I would like to cache the retrieved string by calling "cache.set(search_result, info_retrieved, 600)" where "search_result" is the user's search result, in this case, "grapes". I want the key to be the user's search result, which is "grapes". I cannot do this since the cache requires the key to be a string. How can I get around this? I cannot use a database because the information updates too often.
I could use a database, but I would be writing information to it, then deleting it after a few minutes, which seems impractical. So, I just want to cache it temporarily.

As Shawn Chin mentioned, you should already have a string "version" of your search query, which would work just fine as a cache key.
One limitation with memcached (not sure about other backends) is that certain characters (notably, spaces) are not allowed in keys. The easiest way to get around this is to hash your string key into a hex digest and use that as a key:
from hashlib import sha1
key = sha1('grapes').hexdigest() # '35c4cdb50a9a6b4475da4a66d955ef2a9e1acc39'
If you might have different results for different users (or based on whatever criteria), you can tag/salt/flavor the key with a string representation of that information:
from hashlib import sha1
key = sha1('%s:%s:%s' % (user.id, session.sessionid, 'grapes')).hexdigest()
You could also use django-newcache:
Newcache is an improved memcached cache backend for Django. It provides four major advantages over Django's built-in cache backend:
It supports pylibmc.
It allows for a function to be run on each key before it's sent to memcached.
It supports setting cache keys with infinite timeouts.
It mitigates the thundering herd problem.
It also has some pretty nice defaults. By default, the function that's run on each key is one that hashes, versions, and flavors the key. More on that later.

Related

Why use pagination tokens?

I am implementing pagination on a webservice. My first thought was to use query params page and size, like Spring Data.
However, we are basing some of our design on the google webservice apis. I notice that they use pagination tokens, with each page result containing a nextPageToken. What are the advantages to using this approach? Changing data? What kind of info would be encoded in such a token?
When paginating with an offset, inserts and deletes into the data between your requests will cause rows to be skipped or included twice. If you are able to keep track in the token of what you returned previously (via a key or whatever), you can guarantee you don't return the same result twice, even when there are inserts/deletes between requests.
I'm a little uncertain of how you would encode a token, but for a single tables at least it seems that you could use the an encoded version of the primary key as a limit. "I just returned everything before key=200. Next time I'll only return things after 200." I guess this assumes a new item inserted between requests 1 and 2 will be given a key greater than existing keys.
https://use-the-index-luke.com/no-offset
One reason opaque strings are used for pagination tokens is so that you can change how pagination is implemented without breaking your clients. A query param like a page(I assume you mean page number) is transparent to your client and allows them to make assumptions about it.

Finding when a django cache was set

I'm trying to implement intelligent cache invalidation in Django for my app with an algorithm of the sort:
page = get_cache_for_item(item_pk + some_key)
cache_set_at = page.SOMETHING_HERE
modified = models.Object.filter(pk=item_pk,modified__gt=cache_set_at).exists() #Cheap call
if modified:
page = get_from_database_and_build_slowly(item_pk)
set_cache_for_item(item_pk + some_key)
return page
The intent, is I want to do a quick call to the database to get the modified time, and if and only if the page was modified since the cache was set, build the page using the resource and database intensive page.
Unfortunately, I can't figure out how to get the time a cache was set at at the step SOMETHING_HERE.
Is this possible?
Django does not seem to store that information. The information is (if stored) in the cache implementation of your choice.
This is for example, the way Django stores a key in memcached.
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
key = self.make_key(key, version=version)
if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
# make sure the key doesn't keep its old value in case of failure to set (memcached's 1MB limit)
self._cache.delete(key)
Django does not store the creation time and lets the cache handle the timeout. So if any, you should look into the cache of your choice. I know that Redis, for example, does not store that value either, so you will not be able to make it work at all with redis, even if u bypass Django's cache and look into Redis.
I think your best choice is to store the key yourself somehow. You can maybe override the #cache_page or simply create an improved #smart_cache_page and store the timestamp of creation there.
EDIT:
There might be other easier ways to achieve that. You could use post_save signals. Something like this: Expire a view-cache in Django?
Read carefully through it since the implementation depends on your Django version.

Client side id generation strategy for REST web service

Let's say I want to build a REST service for making notes that looks something like this:
GET /notes/ // gives me all notes
GET /notes/{id} // gives the note identified by {id}
DELETE /notes/{id} // delete note
PUT /notes/{id} // creates a new note if there is no note identified by {id}
// otherwise the existing note is updated
Since I want my service to be indempotent I'm using PUT to create and update my notes,
which implies that the ids of new notes are set/generated by the Client.
I thought of using GUIDs/UUIDs but they are pretty long and would make remembering the URLs rather dificult. Also from a database perspective such long string ids can be troublesome from a performance point of view when used as primary key in big tables.
Do you know a good id generation strategy, that generates short ids and of course avoids collisions?
There is a reason why highly distributed system (like git, mongodb, etc.) use long UUIDs/hashes while centralized relational databases (or svn for that matter) can simply use ints. There is no easy way of creating short ids on the client-side in a distributed fashion. Either the server handles them or you must live with wasteful ids. Typically they contain encoded timestamp, client/computer id, hashed content, etc.
That's why REST services typically use
POST /notes
for non-idempotent save and then use the output of Location: header in response:
Location: /notes/42

Django: Creating a unique identifier for a user based on request.META values

I'm looking at creating an anonymous poll. However, I want to prevent users from voting twice. I was thinking of hashing some request.META values like so:
from hashlib import md5
request_id_keys = (
'HTTP_ACCEPT_CHARSET',
'HTTP_ACCEPT',
'HTTP_ACCEPT_ENCODING',
'HTTP_ACCEPT_LANGUAGE',
'HTTP_CONNECTION',
'HTTP_USER_AGENT',
'REMOTE_ADDR',
)
request_id = md5('|'.join([request.META.get(k, '') for k in requst_id_keys])).hexdigest()
My questions:
Good idea? Bad idea? Why?
Are some of these keys redundant or just overkill? Why?
Are some of these easily changeable? For example, I'm considering removing HTTP_USER_AGENT because I know that's just a simple config change.
Know of a better way of accomplishing this semi-unique identifier that is flexible enough to handle people sharing IP's (NAT) but that a simple config change won't create a new hash?
All of this params are fairly easy to change. Why not just use a cookie for that purpose? I guess something like evercookie
evercookie is a javascript API available that produces extremely persistent cookies in a browser. Its goal is to identify a client even after they've removed standard cookies, Flash cookies (Local Shared Objects or LSOs), and others.

Invalidating a path from the Django cache recursively

I am deleting a single path from the Django cache like this:
from models import Graph
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.db.models.signals import post_save
from django.core.cache import cache
def expire_page(path):
request = HttpRequest()
request.path = path
key = get_cache_key(request)
if cache.has_key(key):
cache.delete(key)
def invalidate_cache(sender, instance, **kwargs):
expire_page(instance.get_absolute_url())
post_save.connect(invalidate_cache, sender = Graph)
This works - but is there a way to delete recursively? My paths look like this:
/graph/123
/graph/123/2009-08-01/2009-10-21
Whenever the graph with id "123" is saved, the cache for both paths needs to be invalidated. Can this be done?
You might want to consider employing a generational caching strategy, it seems like it would fit what you are trying to accomplish. In the code that you have provided, you would store a "generation" number for each absolute url. So for example you would initialize the "/graph/123" to have a generation of one, then its cache key would become something like "/GENERATION/1/graph/123". When you want to expire the cache for that absolute url you increment its generation value (to two in this case). That way, the next time someone goes to look up "/graph/123" the cache key becomes "/GENERATION/2/graph/123". This also solves the issue of expiring all the sub pages since they should be referring to the same cache key as "/graph/123".
Its a bit tricky to understand at first but it is a really elegant caching strategy which if done correctly means you never have to actually delete anything from cache. For more information here is a presentation on generational caching, its for Rails but the concept is the same, regardless of language.
Another option is to use a cache that supports tagging keys and evicting keys by tag. Django's built-in cache API does not have support for this approach. But at least one cache backend (not part of Django proper) does have support.
DiskCache* is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django. To use DiskCache in your project simply install it and configure your CACHES setting.
Installation is easy with pip:
$ pip install diskcache
Then configure your CACHES setting:
CACHES = {
'default': {
'BACKEND': 'diskcache.DjangoCache',
'LOCATION': '/tmp/path/to/directory/',
}
}
The cache set method is extended by an optional tag keyword argument like so:
from django.core.cache import cache
cache.set('/graph/123', value, tag='/graph/123')
cache.set('/graph/123/2009-08-01/2009-10-21', other_value, tag='/graph/123')
diskcache.DjangoCache uses a diskcache.FanoutCache internally. The corresponding FanoutCache is accessible through the _cache attribute and exposes an evict method. To evict all keys tagged with /graph/123 simply:
cache._cache.evict('/graph/123')
Though it may feel awkward to access an underscore-prefixed attribute, the DiskCache project is stable and unlikely to make significant changes to the DjangoCache implementation.
The Django cache benchmarks page has a discussion of alternative cache backends.
Disclaimer: I am the original author of the DiskCache project.
Checkout shutils.rmtree() or os.removedirs(). I think the first is probably what you want.
Update based on several comments: Actually, the Django caching mechanism is more general and finer-grained than just using the path for the key (although you can use it at that level). We have some pages that have 7 or 8 separately cached subcomponents that expire based on a range of criteria. Our component cache names reflect the key objects (or object classes) and are used to identify what needs to be invalidated on certain updates.
All of our pages have an overall cache-key based on member/non-member status, but that is only about 95% of the page. The other 5% can change on a per-member basis and so is not cached at all.
How you iterate through your cache to find invalid items is a function of how it's actually stored. If it's files you can use simply globs and/or recursive directory deletes, if it's some other mechanism then you'll have to use something else.
What my answer, and some of the comments by others, are trying to say is that how you accomplish cache invalidation is intimately tied to how you are using/storing the cache.
Second Update: #andybak: So I guess your comment means that all of my commercial Django sites are going to explode in flames? Thanks for the heads up on that. I notice you did not attempt an answer to the problem.
Knipknap's problem is that he has a group of cache items that appear to be related and in a hierarchy because of their names, but the key-generation logic of the cache mechanism obliterates that name by creating an MD5 hash of the path + vary_on. Since there is no trace of the original path/params you will have to exhaustively guess all possible path/params combinations, hoping you can find the right group. I have other hobbies that are more interesting.
If you wish to be able to find groups of cached items based on some combination of path and/or parameter values you must either use cache keys that can be pattern matched directly or some system that retains this information for use at search time.
Because we had needs not-unrelated to the OP's problem, we took control of template fragment caching -- and specifically key generation -- over 2 years ago. It allows us to use regexps in a number of ways to efficiently invalidate groups of related cached items. We also added a default timeout and vary_on variable names (resolved at run time) configurable in settings.py, changed the ordering of name & timeout because it made no sense to always have to override the default timeout in order to name the fragment, made the fragment_name resolvable (ie. it can be a variable) to work better with a multi-level template inheritance scheme, and a few other things.
The only reason for my initial answer, which was indeed wrong for current Django, was because I have been using saner cache keys for so long I literally forgot the simple mechanism we walked away from.