Finding when a django cache was set - django

I'm trying to implement intelligent cache invalidation in Django for my app with an algorithm of the sort:
page = get_cache_for_item(item_pk + some_key)
cache_set_at = page.SOMETHING_HERE
modified = models.Object.filter(pk=item_pk,modified__gt=cache_set_at).exists() #Cheap call
if modified:
page = get_from_database_and_build_slowly(item_pk)
set_cache_for_item(item_pk + some_key)
return page
The intent, is I want to do a quick call to the database to get the modified time, and if and only if the page was modified since the cache was set, build the page using the resource and database intensive page.
Unfortunately, I can't figure out how to get the time a cache was set at at the step SOMETHING_HERE.
Is this possible?

Django does not seem to store that information. The information is (if stored) in the cache implementation of your choice.
This is for example, the way Django stores a key in memcached.
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None):
key = self.make_key(key, version=version)
if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
# make sure the key doesn't keep its old value in case of failure to set (memcached's 1MB limit)
self._cache.delete(key)
Django does not store the creation time and lets the cache handle the timeout. So if any, you should look into the cache of your choice. I know that Redis, for example, does not store that value either, so you will not be able to make it work at all with redis, even if u bypass Django's cache and look into Redis.
I think your best choice is to store the key yourself somehow. You can maybe override the #cache_page or simply create an improved #smart_cache_page and store the timestamp of creation there.
EDIT:
There might be other easier ways to achieve that. You could use post_save signals. Something like this: Expire a view-cache in Django?
Read carefully through it since the implementation depends on your Django version.

Related

django update_or_create(), see what got updated

My django app uses update_or_create() to update a bunch of records. In some cases, updates are really few within a ton of records, and it would be nice to know what got updated within those records. Is it possible to know what got updated (i.e fields whose values got changed)? If not, does any one has ideas of workarounds to achieve that?
This will be invoked from the shell, so ideally it would be nice to be prompted for confirmation just before a value is being changed within update_or_create(), but if not that, knowing what got changed will also help.
Update (more context): Thought I'd give more context here. The data in this Django app gets updated through various means (through users coming on the web site, through the admin page, through scripts (run from the shell) that populate data from a csv etc.). The above question is important mostly for the shell scripts that update data from csvs, hence a solution at the database/trigger/signal level may not be helpful here (I guess).
This is what I ended up doing:
for row in reader:
school_obj0, created = Org.objects.get_or_create(school_id = row[0])
if (school_obj0.name != row[1]):
print (school_obj0.name, '==>', row[1])
confirmation = input('proceed? [y/n]: ')
if (confirmation == 'y'):
school_obj1, created = Org.objects.update_or_create(
school_id = row[0], defaults={"name": row[1],})
Happy to know about improvements to this approach (please see the update in the question with more context)
This will be invoked from the shell, so ideally it would be nice to be
prompted for confirmation just before a value is being changed
Unfortunately, databases don't work like that. It's the responsibility of applications to provide this functionality. And django isn't an application. You can however use django to write an application that provides this functionality.
As for finding out whether an object was updated or created, that's what the return value gives you. A tuple where the second value is a flag for update or create

django cache conditional view processing latest_entry

I have a model called aps
class aps(model.Model):
u=models.ForeignKey(User, related_name='u')
when=models.DateTimeField() #datetime when row was inserted
a=models.ForeignKey(User, related_name='a')
a_read=models.BooleanField()
a_last_read=models.DateTimeField()
The records from aps are retrieved like this:
def displayAps(request, name)
apz=aps.objects.order_by('-when').filter(u=User.objects.get(username=name))
return render_to_response(template.html, {'apz':apz})
And further they are shown in template.html ...
What I am trying to achieve is actually what django docz mention here about conditional view processing
I do something like:
def latest_entry(request, name):
return aps.objects.filter().latest('when').when
#cache_page(60*15)
#last_modified(latest_entry)
def displayAps(request, name)
apz=aps.objects.order_by('-when').filter(u=User.objects.get(username=name))
return render_to_response(template.html, {'apz':apz})
If new rows are added the content is not modified, but retrieved from cache. I need to delete the cache files and 'shift refresh' the browser to see the new rows.
Anyone sees what I am doing wrong here?
If you take out the cache_page decorator, does it work?
The cache_page decorator is the first piece of code that actually gets called in your view function. It is checking the timestamp, and returning the cached data, as it is supposed to. If the cache hasn't expired, the last_modified decorator won't ever be called.
You probably want to be more careful about mixing conditional response processing and static caching, anyway. They accomplish similar things, but using very different mechanisms.
cache_page tells django to only use the view to render an actual response every n seconds. If another request comes in before that, the same rendered content will be returned to the client -- whether it is actually stale or not. This reduces your server load, but doesn't do anything to reduce your bandwidth.
last_modified handles the case where the client says "I have a version of this page that is this old; is it still good?" In that case, your server can check the database, and return a very short "It's still good" response if the database hasn't changed. This cuts down your bandwidth needs significantly for those cases, but you still need to go to the database to determine whether the client's cache is stale or not, so your server load may be almost the same.
Like I mentioned above, you can't just apply cache_page before last_modfied -- if the database has changed, cache_page won't know about it. Worse, if the cache timeout has expired, but the database hasn't changed, then you might end up caching the "304 Not modified" message, and sending that for all subsequent visitors for the next fifteen minutes.
You could apply the decorators in the other order, but you have to make a request from the database for each request, and you could still get into a situation where the database has changed, but the cache hasn't expired -- in that case, the client could still be getting the old version of the page, even though the server has already hit the database to determine that it has been updated.
The filter depends on some logged user or something, you should id this user in cookies and use django's vary_on_cookie.

Django Caching - How to generate custom key names?

Right now, I am retrieving information from an API, and I would like to cache the information I get back, so I do not have to constantly hit their server and use up my max API call requests. Right now, a user can search up a particular keyword, like "grapes", I would like to cache the retrieved string by calling "cache.set(search_result, info_retrieved, 600)" where "search_result" is the user's search result, in this case, "grapes". I want the key to be the user's search result, which is "grapes". I cannot do this since the cache requires the key to be a string. How can I get around this? I cannot use a database because the information updates too often.
I could use a database, but I would be writing information to it, then deleting it after a few minutes, which seems impractical. So, I just want to cache it temporarily.
As Shawn Chin mentioned, you should already have a string "version" of your search query, which would work just fine as a cache key.
One limitation with memcached (not sure about other backends) is that certain characters (notably, spaces) are not allowed in keys. The easiest way to get around this is to hash your string key into a hex digest and use that as a key:
from hashlib import sha1
key = sha1('grapes').hexdigest() # '35c4cdb50a9a6b4475da4a66d955ef2a9e1acc39'
If you might have different results for different users (or based on whatever criteria), you can tag/salt/flavor the key with a string representation of that information:
from hashlib import sha1
key = sha1('%s:%s:%s' % (user.id, session.sessionid, 'grapes')).hexdigest()
You could also use django-newcache:
Newcache is an improved memcached cache backend for Django. It provides four major advantages over Django's built-in cache backend:
It supports pylibmc.
It allows for a function to be run on each key before it's sent to memcached.
It supports setting cache keys with infinite timeouts.
It mitigates the thundering herd problem.
It also has some pretty nice defaults. By default, the function that's run on each key is one that hashes, versions, and flavors the key. More on that later.

Invalidating a path from the Django cache recursively

I am deleting a single path from the Django cache like this:
from models import Graph
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.db.models.signals import post_save
from django.core.cache import cache
def expire_page(path):
request = HttpRequest()
request.path = path
key = get_cache_key(request)
if cache.has_key(key):
cache.delete(key)
def invalidate_cache(sender, instance, **kwargs):
expire_page(instance.get_absolute_url())
post_save.connect(invalidate_cache, sender = Graph)
This works - but is there a way to delete recursively? My paths look like this:
/graph/123
/graph/123/2009-08-01/2009-10-21
Whenever the graph with id "123" is saved, the cache for both paths needs to be invalidated. Can this be done?
You might want to consider employing a generational caching strategy, it seems like it would fit what you are trying to accomplish. In the code that you have provided, you would store a "generation" number for each absolute url. So for example you would initialize the "/graph/123" to have a generation of one, then its cache key would become something like "/GENERATION/1/graph/123". When you want to expire the cache for that absolute url you increment its generation value (to two in this case). That way, the next time someone goes to look up "/graph/123" the cache key becomes "/GENERATION/2/graph/123". This also solves the issue of expiring all the sub pages since they should be referring to the same cache key as "/graph/123".
Its a bit tricky to understand at first but it is a really elegant caching strategy which if done correctly means you never have to actually delete anything from cache. For more information here is a presentation on generational caching, its for Rails but the concept is the same, regardless of language.
Another option is to use a cache that supports tagging keys and evicting keys by tag. Django's built-in cache API does not have support for this approach. But at least one cache backend (not part of Django proper) does have support.
DiskCache* is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django. To use DiskCache in your project simply install it and configure your CACHES setting.
Installation is easy with pip:
$ pip install diskcache
Then configure your CACHES setting:
CACHES = {
'default': {
'BACKEND': 'diskcache.DjangoCache',
'LOCATION': '/tmp/path/to/directory/',
}
}
The cache set method is extended by an optional tag keyword argument like so:
from django.core.cache import cache
cache.set('/graph/123', value, tag='/graph/123')
cache.set('/graph/123/2009-08-01/2009-10-21', other_value, tag='/graph/123')
diskcache.DjangoCache uses a diskcache.FanoutCache internally. The corresponding FanoutCache is accessible through the _cache attribute and exposes an evict method. To evict all keys tagged with /graph/123 simply:
cache._cache.evict('/graph/123')
Though it may feel awkward to access an underscore-prefixed attribute, the DiskCache project is stable and unlikely to make significant changes to the DjangoCache implementation.
The Django cache benchmarks page has a discussion of alternative cache backends.
Disclaimer: I am the original author of the DiskCache project.
Checkout shutils.rmtree() or os.removedirs(). I think the first is probably what you want.
Update based on several comments: Actually, the Django caching mechanism is more general and finer-grained than just using the path for the key (although you can use it at that level). We have some pages that have 7 or 8 separately cached subcomponents that expire based on a range of criteria. Our component cache names reflect the key objects (or object classes) and are used to identify what needs to be invalidated on certain updates.
All of our pages have an overall cache-key based on member/non-member status, but that is only about 95% of the page. The other 5% can change on a per-member basis and so is not cached at all.
How you iterate through your cache to find invalid items is a function of how it's actually stored. If it's files you can use simply globs and/or recursive directory deletes, if it's some other mechanism then you'll have to use something else.
What my answer, and some of the comments by others, are trying to say is that how you accomplish cache invalidation is intimately tied to how you are using/storing the cache.
Second Update: #andybak: So I guess your comment means that all of my commercial Django sites are going to explode in flames? Thanks for the heads up on that. I notice you did not attempt an answer to the problem.
Knipknap's problem is that he has a group of cache items that appear to be related and in a hierarchy because of their names, but the key-generation logic of the cache mechanism obliterates that name by creating an MD5 hash of the path + vary_on. Since there is no trace of the original path/params you will have to exhaustively guess all possible path/params combinations, hoping you can find the right group. I have other hobbies that are more interesting.
If you wish to be able to find groups of cached items based on some combination of path and/or parameter values you must either use cache keys that can be pattern matched directly or some system that retains this information for use at search time.
Because we had needs not-unrelated to the OP's problem, we took control of template fragment caching -- and specifically key generation -- over 2 years ago. It allows us to use regexps in a number of ways to efficiently invalidate groups of related cached items. We also added a default timeout and vary_on variable names (resolved at run time) configurable in settings.py, changed the ordering of name & timeout because it made no sense to always have to override the default timeout in order to name the fragment, made the fragment_name resolvable (ie. it can be a variable) to work better with a multi-level template inheritance scheme, and a few other things.
The only reason for my initial answer, which was indeed wrong for current Django, was because I have been using saner cache keys for so long I literally forgot the simple mechanism we walked away from.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()