Hierarchical cache in Django - django

What I want to do is to mark some values in the cache as related so I could delete them at once. For example when I insert a new entry to the database I want to delete everything in the cache which was based on the old values in database.
I could always use cache.clear() but it seems too brutal to me. Or I could store related values together in the dictionary and cache this dictionary. Or I could maintain some kind of index in an extra field in cache. But everything seems to complicated to me (eventually slow?).
What you think? Is there any existing solution? Or is my approach wrong? Thanks for answers.

Are you using the cache api? It sounds like it.
This post, which pointed me to these slides helped me create a nice generational caching system which let me create the hierarchy I wanted.
In short, you store a generation key (such as group) in your cache and incorporate the value stored into your key creation function so that you can invalidate a whole set of keys at once.
With this basic concept you could create highly complex hierarchies or just a simple group system.
For example:
class Cache(object):
def generate_cache_key(self, key, group=None):
"""
Generate a cache key relating them via an outside source (group)
Generates key such as 'group-1:KEY-your-key-here'
Note: consider this pseudo code and definitely incomplete code.
"""
key_fragments = [('key', key)]
if group:
key_fragments.append((group, cache.get(group, '1')))
combined_key = ":".join(['%s-%s' % (name, value) for name, value in key_fragments)
hashed_key = md5(combined_key).hexdigest()
return hashed_key
def increment_group(self, group):
"""
Invalidate an entire group
"""
cache.incr(group)
def set(self, key, value, group=None):
key = self.generate_cache_key(key, group)
cache.set(key, value)
def get(self, key, group=None):
key = self.generate_cache_key(key, group)
return cache.get(key)
# example
>>> cache = Cache()
>>> cache.set('key', 'value', 'somehow_related')
>>> cache.set('key2', 'value2', 'somehow_related')
>>> cache.increment_group('somehow_related')
>>> cache.get('key') # both invalidated
>>> cache.get('key2') # both invalidated

Caching a dict or something serialised (with JSON or the like) sounds good to me. The cache backends are key-value stores like memcache, they aren't hierarchical.

Related

Effecient Bulk Update of Model Records in Django

I'm building a Django app that will periodically take information from an external source, and use it to update model objects.
What I want to to be able to do is create a QuerySet which has all the objects which might match the final list. Then check which model objects need to be created, updated, and deleted. And then (ideally) perform the update in the fewest number of transactions. And without performing any unnecessary DB operations.
Using create_or_update gets me most of the way to what I want to do.
jobs = get_current_jobs(host, user)
for host, user, name, defaults in jobs:
obj, _ = Job.upate_or_create(host=host, user=user, name=name, defaults=defaults)
The problem with this approach is that it doesn't delete anything that no longer exists.
I could just delete everything up front, or do something dumb like
to_delete = set(Job.objects.filter(host=host, user=user)) - set(current)
(Which is an option) but I feel like there must already be an elegant solution that doesn't require either deleting everything, or reading everything into memory.
You should use Redis for storage and use this python package in your code. For example:
import redis
import requests
pool = redis.StrictRedis('localhost')
time_in_seconds = 3600 # the time period you want to keep your data
response = requests.get("url_to_ext_source")
pool.set("api_response", response.json(), ex=time_in_seconds)

Django Sessions via Memcache: Cannot find session key manually

I recently migrated from database backed sessions to sessions stored via memcached using pylibmc.
Here is my CACHES, SESSION_CACHE_ALIAS & SESSION_ENGINE in my settings.py
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.memcached.PyLibMCCache',
'LOCATION': ['127.0.0.1:11211'],
}
}
SESSION_CACHE_ALIAS = 'default'
SESSION_ENGINE = "django.contrib.sessions.backends.cache"
Everything is working fine behind the scenes and I can see that it is using the new caching system. Running the get_stats() method from pylibmc shows me the number of current items in the cache and I can see that it has gone up by 1.
The issue is I'm unable to grab the session manually using pylibmc.
Upon inspecting the request session data in views.py:
def my_view(request):
if request.user.is_authenticated():
print request.session.session_key
# the above prints something like this: "1ay2kcv7axb3nu5fwnwoyf85wkwsttz9"
print request.session.cache_key
# the above prints something like this: "django.contrib.sessions.cache1ay2kcv7axb3nu5fwnwoyf85wkwsttz9"
return HttpResponse(status=200)
else:
return HttpResponse(status=401)
I noticed that when printing cache_key, it prints with the default KEY_PREFIX whereas for session_key it didn't. Take a look at the comments in the code to see what I mean.
So I figured, "Ok great, one of these key names should work. Let me try grabbing the session data manually just for educational purposes":
import pylibmc
mc = pylibmc.Client(['127.0.0.1:11211'])
# Let's try key "1ay2kcv7axb3nu5fwnwoyf85wkwsttz9"
mc.get("1ay2kcv7axb3nu5fwnwoyf85wkwsttz9")
Hmm nothing happens, no key exists by that name. Ok no worries, let's try the cache_key then, that should definitely work right?
mc.get("django.contrib.sessions.cache1ay2kcv7axb3nu5fwnwoyf85wkwsttz9")
What? How am I still getting nothing back? As I test I decide to set and get a random key value to see if it works and it does. I run get_stats() again just to make sure that the key does exist. I also test the web app to see if indeed my session is working and it does. So this leads me to conclude that there is a different naming scheme that I'm unaware of.
If so, what is the correct naming scheme?
Yes, the cache key used internally by Django is, in general, different to the key sent to the cache backend (in this case pylibmc / memcached). Let us call these two keys the django cache key and the final cache key respectively.
The django cache key given by request.session.cache_key is for use with Django's low-level cache API, e.g.:
>>> from django.core.cache import cache
>>> cache.get(request.session.cache_key)
{'_auth_user_hash': '1ay2kcv7axb3nu5fwnwoyf85wkwsttz9', '_auth_user_id': u'1', '_auth_user_backend': u'django.contrib.auth.backends.ModelBackend'}
The final cache key on the other hand, is a composition of the key prefix, the django cache key, and the cache version number. The make_key function (from Django docs) below demonstrates how these three values are composed to generate this key:
def make_key(key, key_prefix, version):
return ':'.join([key_prefix, str(version), key])
By default, key_prefix is the empty string and version is 1.
Finally, by inspecting make_key we find that the correct final cache key to pass to mc.get is
:1:django.contrib.sessions.cache1ay2kcv7axb3nu5fwnwoyf85wkwsttz9
which has the form <KEY_PREFIX>:<VERSION>:<KEY>.
Note: the final cache key can be changed by defining KEY_FUNCTION in the cache settings.

How to invalidate several django cached values efficiently?

TLDR
Is there a way to mark cached values so I could do something like:
cache.filter('some_tag').clear()
Details
In my project I have the following model:
class Item(models.Model):
month = models.DateField('month', null=False, blank=False, db_index=True)
kg = models.BigIntegerField('kg')
tags = models.ManyToManyField('Tag', related_name='items')
// bunch of other fields used to filter data
And I have a report_view that returns the sum of kg by month and by tag according to the filters supplied in the URL query.
Something like this:
--------------------------------
|Tag |jan |fev |mar |
--------------------------------
|Tag 1 |1000 |1500 |2000 |
--------------------------------
|Tag 2 |1235 |4652 |0 |
--------------------------------
As my Item table has already more than 4 million records and is always growing my report_view is cached.
So far I got all of this covered.
The problem is: the site user can change the tags from the Items and every time this occurs I have to invalidate the cache, but I would like to do it in a more granular way.
For example if a user changes a tag in a Item from january that should invalidate all the totals for that month (I prefer to cache by month because sometimes changing one tag has a cascading effect on others). However I don't know all the views that have been cached as there are thousands of possibilities of different filters that change the URL.
What I have done so far:
Set a signal to invalidate all my caches when a tag changes
#receiver(m2m_changed, sender=Item.tags.through)
def tags_changed(sender, **kwargs):
cache.clear()
But this cleans everything which is not optimal in my case. Is there a way of doing something like cache.filter('some_tag').clear() with Django cache framework?
https://martinfowler.com/bliki/TwoHardThings.html
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
Presuming you are using Django's Cache Middleware, you'll need to target the cache keys that are relevant. You can see how they generate the cache key from these two files in the Django Project:
-
https://github.com/django/django/blob/master/django/middleware/cache.py#L99
- https://github.com/django/django/blob/master/django/utils/cache.py#L367
- https://github.com/django/django/blob/master/django/utils/cache.py#L324
_generate_cache_key
def _generate_cache_key(request, method, headerlist, key_prefix):
"""Return a cache key from the headers given in the header list."""
ctx = hashlib.md5()
for header in headerlist:
value = request.META.get(header)
if value is not None:
ctx.update(force_bytes(value))
url = hashlib.md5(force_bytes(iri_to_uri(request.build_absolute_uri())))
cache_key = 'views.decorators.cache.cache_page.%s.%s.%s.%s' % (key_prefix, method, url.hexdigest(), ctx.hexdigest())
return _i18n_cache_key_suffix(request, cache_key)
The cache key is generated based on attributes and headers from the request and hashed values (i.e. the url is hashed and used as part of the key). The Vary header in your response specifies other headers to use as part of the cache it.
If you understand how Django is caching your views and calculating your cache keys, then you can use this to target appropriate cache entries, but this is still very difficult because the url is hashed you can't target url patterns (you could use https://stackoverflow.com/a/35629796/784648 cache.delete_patterns(...) otherwise).
Django primarily relies on timeout to invalidate the cache.
I would recommend looking into Django Cacheops, this package is designed to work with Django's ORM to cache and invalidate QuerySets. This seems a lot more practical for your needs because you want fine-grained invalidation on your Item QuerySets, you simply will not get that from Django's Cache Middleware. Take a look at the github repo, I've used it and it works well if you take the time to read the docs and understand it.

Bulk create model objects in django

I have a lot of objects to save in database, and so I want to create Model instances with that.
With django, I can create all the models instances, with MyModel(data), and then I want to save them all.
Currently, I have something like that:
for item in items:
object = MyModel(name=item.name)
object.save()
I'm wondering if I can save a list of objects directly, eg:
objects = []
for item in items:
objects.append(MyModel(name=item.name))
objects.save_all()
How to save all the objects in one transaction?
as of the django development, there exists bulk_create as an object manager method which takes as input an array of objects created using the class constructor. check out django docs
Use bulk_create() method. It's standard in Django now.
Example:
Entry.objects.bulk_create([
Entry(headline="Django 1.0 Released"),
Entry(headline="Django 1.1 Announced"),
Entry(headline="Breaking: Django is awesome")
])
worked for me to use manual transaction handling for the loop(postgres 9.1):
from django.db import transaction
with transaction.atomic():
for item in items:
MyModel.objects.create(name=item.name)
in fact it's not the same, as 'native' database bulk insert, but it allows you to avoid/descrease transport/orms operations/sql query analyse costs
name = request.data.get('name')
period = request.data.get('period')
email = request.data.get('email')
prefix = request.data.get('prefix')
bulk_number = int(request.data.get('bulk_number'))
bulk_list = list()
for _ in range(bulk_number):
code = code_prefix + uuid.uuid4().hex.upper()
bulk_list.append(
DjangoModel(name=name, code=code, period=period, user=email))
bulk_msj = DjangoModel.objects.bulk_create(bulk_list)
Here is how to bulk-create entities from column-separated file, leaving aside all unquoting and un-escaping routines:
SomeModel(Model):
#classmethod
def from_file(model, file_obj, headers, delimiter):
model.objects.bulk_create([
model(**dict(zip(headers, line.split(delimiter))))
for line in file_obj],
batch_size=None)
Using create will cause one query per new item. If you want to reduce the number of INSERT queries, you'll need to use something else.
I've had some success using the Bulk Insert snippet, even though the snippet is quite old.
Perhaps there are some changes required to get it working again.
http://djangosnippets.org/snippets/446/
Check out this blog post on the bulkops module.
On my django 1.3 app, I have experienced significant speedup.
bulk_create() method is one of the ways to insert multiple records in the database table. How the bulk_create()
**
Event.objects.bulk_create([
Event(event_name="Event WF -001",event_type = "sensor_value"),
Entry(event_name="Event WT -002", event_type = "geozone"),
Entry(event_name="Event WD -001", event_type = "outage") ])
**
for a single line implementation, you can use a lambda expression in a map
map(lambda x:MyModel.objects.get_or_create(name=x), items)
Here, lambda matches each item in items list to x and create a Database record if necessary.
Lambda Documentation
The easiest way is to use the create Manager method, which creates and saves the object in a single step.
for item in items:
MyModel.objects.create(name=item.name)

Modifying Dictionary in Django Session Does Not Modify Session

I am storing dictionaries in my session referenced by a string key:
>>> request.session['my_dict'] = {'a': 1, 'b': 2, 'c': 3}
The problem I encountered was that when I modified the dictionary directly, the value would not be changed during the next request:
>>> request.session['my_dict'].pop('c')
3
>>> request.session.has_key('c')
False
# looks okay...
...
# Next request
>>> request.session.has_key('c')
True
# what gives!
As the documentation states, another option is to use
SESSION_SAVE_EVERY_REQUEST=True
which will make this happen every request anyway. Might be worth it if this happens a lot in your code; I'm guessing the occasional additional overhead wouldn't be much and it is far less than the potential problems from neglecting from including the
request.session.modified = True
line each time.
I apologize for "asking" a question to which I already know the answer, but this was frustrating enough that I thought the answer should be recorded on stackoverflow. If anyone has something to add to my explanation I will award the "answer". I couldn't find the answer by searching based on the problem, but after searching based upon the answer I found that my "problem" is documented behavior. Also turns out another person had this problem.
It turns out that SessionBase is a dictionary-like object that keeps track of when you modify it's keys, and manually sets an attribute modified (there's also an accessed). If you mess around with objects within those keys, however, SessionBase has no way to know that the objects are modified, and therefore your changes might not get stored in whatever backend you are using. (I'm using a database backend; I presume this problem applies to all backends, though.) This problem might not apply to models, since the backend is probably storing a reference to the model (and therefore would receive any changes when it loaded the model from the database), but the problem does apply to dictionaries (and perhaps any other base python types that must be stored entirely in the session store.)
The trick is that whenever you modify objects in the session that the session won't notice, you must explicitly tell the session that it is modified:
>>> request.session.modified = True
Hope this helps someone.
The way I got around this was to encapsulate any pop actions on the session into a method that takes care of the details (this method also accepts a view parameter so that session variables can be view-specific):
def session_pop(request, view, key, *args, **kwargs):
"""
Either returns and removes the value of the key from request.session, or,
if request.session[key] is a list, returns the result of a pop on this
list.
Also, if view is not None, only looks within request.session[view.func_name]
so that I can store view-specific session variables.
"""
# figure out which dictionary we want to operate on.
dicto = {}
if view is None:
dicto = request.session
else:
if request.session.has_key(view.func_name):
dicto = request.session[view.func_name]
if dicto.has_key(key):
# This is redundant if `dicto == request.session`, but rather than
# duplicate the logic to test whether we popped a list underneath
# the root level of the session, (which is also determined by `view`)
# just explicitly set `modified`
# since we certainly modified the session here.
request.session.modified = True
# Return a non-list
if not type(dicto[key]) == type(list()):
return dicto.pop(key)
# pop a list
else:
if len(dicto[key]) > 0:
return dicto[key].pop()
# Parse out a default from the args/kwargs
if len(args) > 0:
default = args[0]
elif kwargs.has_key('default'):
default = kwargs['default']
else:
# If there wasn't one, complain
raise KeyError('Session does not have key "{0}" and no default was provided'.format(key))
return default
I'm not too surprised by this. I guess it's just like modifying the contents of a tuple:
a = (1,[2],3)
print a
>>> 1, [2], 3)
a[1] = 4
>>> Traceback (most recent call last):
... File "<stdin>", line 1, in <module>
... TypeError: 'tuple' object does not support item assignment
print a
>>> 1, [2], 3)
a[1][0] = 4
print a
>>> 1, [4], 3)
But thanks anyway.