How to invalidate several django cached values efficiently? - django

TLDR
Is there a way to mark cached values so I could do something like:
cache.filter('some_tag').clear()
Details
In my project I have the following model:
class Item(models.Model):
month = models.DateField('month', null=False, blank=False, db_index=True)
kg = models.BigIntegerField('kg')
tags = models.ManyToManyField('Tag', related_name='items')
// bunch of other fields used to filter data
And I have a report_view that returns the sum of kg by month and by tag according to the filters supplied in the URL query.
Something like this:
--------------------------------
|Tag |jan |fev |mar |
--------------------------------
|Tag 1 |1000 |1500 |2000 |
--------------------------------
|Tag 2 |1235 |4652 |0 |
--------------------------------
As my Item table has already more than 4 million records and is always growing my report_view is cached.
So far I got all of this covered.
The problem is: the site user can change the tags from the Items and every time this occurs I have to invalidate the cache, but I would like to do it in a more granular way.
For example if a user changes a tag in a Item from january that should invalidate all the totals for that month (I prefer to cache by month because sometimes changing one tag has a cascading effect on others). However I don't know all the views that have been cached as there are thousands of possibilities of different filters that change the URL.
What I have done so far:
Set a signal to invalidate all my caches when a tag changes
#receiver(m2m_changed, sender=Item.tags.through)
def tags_changed(sender, **kwargs):
cache.clear()
But this cleans everything which is not optimal in my case. Is there a way of doing something like cache.filter('some_tag').clear() with Django cache framework?

https://martinfowler.com/bliki/TwoHardThings.html
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
Presuming you are using Django's Cache Middleware, you'll need to target the cache keys that are relevant. You can see how they generate the cache key from these two files in the Django Project:
-
https://github.com/django/django/blob/master/django/middleware/cache.py#L99
- https://github.com/django/django/blob/master/django/utils/cache.py#L367
- https://github.com/django/django/blob/master/django/utils/cache.py#L324
_generate_cache_key
def _generate_cache_key(request, method, headerlist, key_prefix):
"""Return a cache key from the headers given in the header list."""
ctx = hashlib.md5()
for header in headerlist:
value = request.META.get(header)
if value is not None:
ctx.update(force_bytes(value))
url = hashlib.md5(force_bytes(iri_to_uri(request.build_absolute_uri())))
cache_key = 'views.decorators.cache.cache_page.%s.%s.%s.%s' % (key_prefix, method, url.hexdigest(), ctx.hexdigest())
return _i18n_cache_key_suffix(request, cache_key)
The cache key is generated based on attributes and headers from the request and hashed values (i.e. the url is hashed and used as part of the key). The Vary header in your response specifies other headers to use as part of the cache it.
If you understand how Django is caching your views and calculating your cache keys, then you can use this to target appropriate cache entries, but this is still very difficult because the url is hashed you can't target url patterns (you could use https://stackoverflow.com/a/35629796/784648 cache.delete_patterns(...) otherwise).
Django primarily relies on timeout to invalidate the cache.
I would recommend looking into Django Cacheops, this package is designed to work with Django's ORM to cache and invalidate QuerySets. This seems a lot more practical for your needs because you want fine-grained invalidation on your Item QuerySets, you simply will not get that from Django's Cache Middleware. Take a look at the github repo, I've used it and it works well if you take the time to read the docs and understand it.

Related

How to invalidate cache_page in Django?

Here is the problem: I have blog app and I cache the post output view for 5 minutes.
#cache_page(60 * 5)
def article(request, slug):
...
However, I'd like to invalidate the cache whenever a new comment is added to the post.
I'm wondering how best to do so?
I've seen this related question, but it is outdated.
I would cache in a bit different way:
def article(request, slug):
cached_article = cache.get('article_%s' % slug)
if not cached_article:
cached_article = Article.objects.get(slug=slug)
cache.set('article_%s' % slug, cached_article, 60*5)
return render(request, 'article/detail.html', {'article':cached_article})
then saving the new comment to this article object:
# ...
# add the new comment to this article object, then
if cache.get('article_%s' % article.slug):
cache.delete('article_%s' % article.slug)
# ...
This was the first hit for me when searching for a solution, and the current answer wasn't terribly helpful, so after a lot of poking around Django's source, I have an answer for this one.
Yes you can know the key programmatically, but it takes a little work.
Django's page caching works by referencing the request object, specifically the request path and query string. This means that for every request to your page that has a different query string, you will have a different cache key. For most cases, this isn't likely to be a problem, since the page you want to cache/invalidate will be a known string like /blog/my-awesome-year, so to invalidate this, you just need to use Django's RequestFactory:
from django.core.cache import cache
from django.test import RequestFactory
from django.urls import reverse
from django.utils.cache import get_cache_key
cache.delete(get_cache_key(RequestFactory().get("/blog/my-awesome-year")))
If your URLs are a fixed list of values (ie. no differing query strings) then you can stop here. However if you've got lots of different query strings (say ?q=xyz for a search page or something), then your best bet is probably to create a separate cache for each view. Then you can just pass cache="cachename" to cache_page() and later clear that entire cache with:
from django.core.cache import caches
caches["my_cache_name"].clear()
Important note about this tactic
It only really works for unauthenticated pages. The minute your user is logged in, the cookie data is made part of the cache key creation process, and therefore re-creating that key programmatically becomes much harder. I suppose you could try pulling the cookie data out of your session store, but there could be thousands of keys in there, and you'd have to invalidate/pre-cache each and every one of them.

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

What lightweight django caching options are there, for a simple inner loop, in a single session?

I have something that looks like this:
pages = Page.objects.prefetch_related("sections","sections__tiles").all()
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print tile.author
# more stuff to build each page
This hits the SQL layer once for the initial query, then once per loop (n+1).
However the optimal number of SQL queries is 1 + (count of unique authors).
I implemented a simple hash based cash of "authors", and cut load time dramatically.
cache_author = {}
def author_cache(author_id):
author = cache_author.get(author_id, None)
if author:
return author
else:
author = Author.objects.get(id=author_id)
cache_author[author_id] = author
return author
pages = Page.objects.prefetch_related("sections","sections__tiles").all()
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print author_cache(tile.author_id)
# more stuff to build each page
But it feels messy. What cleaner options are out there, for reducing SQL overhead
inside a single transaction?
Again, I can't be positive without seeing the models but I tested this setup in Django 1.7 and it should get everything with only a couple of queries.
pages = Page.objects.prefetch_related("sections","sections__tiles").all().select_related(‌​'sections__tiles__author')
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print tile.author

Optimizing database queries in Django

I have a bit of code that is causing my page to load pretty slow (49 queries in 128 ms). This is the landing page for my site -- so it needs to load snappily.
The following is my views.py that creates a feed of latest updates on the site and is causing the slowest load times from what I can see in the Debug toolbar:
def product_feed(request):
""" Return all site activity from friends, etc. """
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')
latestdesigns = Design.objects.all().order_by('-added')
latest = list(latestparts) + list(latestdesigns)
latestupdates = sorted (latest, key = lambda x: x.added, reverse = True)
latestupdates = latestupdates [0:8]
# only get the unique avatars that we need to put on the page so we're not pinging for images for each update
uniqueusers = User.objects.filter(id__in = Part.objects.values_list('adder', flat=True))
return render_to_response("homepage.html", {
"uniqueusers": uniqueusers,
"latestupdates": latestupdates
}, context_instance=RequestContext(request))
The query that causes the most time seem to be:
latest = list(latestparts) + list(latestdesigns) (25ms)
There is another one that's 17ms (sitewide annoucements) and 25ms (adding tagged items on each product feed item) respectively that I am also investigating.
Does anyone see any ways in which I can optimize the loading of my activity feed?
You never need more than 8 items, so limit your queries. And don't forget to make sure that added in both models is indexed.
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')[:8]
latestdesigns = Design.objects.all().order_by('-added')[:8]
For bonus marks, eliminate the magic number.
After making those queries a bit faster, you might want to check out memcache to store the most common query results.
Moreover, I believe adder is ForeignKey to User model.
Part.objects.distinct().values_list('adder', flat=True)
Above line is QuerySet with unique addre values. I believe you ment exactly that.
It saves you performing a subuery.

django admin actions on all the filtered objects

Admin actions can act on the selected objects in the list page.
Is it possible to act on all the filtered objects?
For example if the admin search for Product names that start with "T-shirt" which results with 400 products and want to increase the price of all of them by 10%.
If the admin can only modify a single page of result at a time it will take a lot of effort.
Thanks
The custom actions are supposed to be used on a group of selected objects, so I don't think there is a standard way of doing what you want.
But I think I have a hack that might work for you... (meaning: use at your own risk and it is untested)
In your action function the request.GET will contain the q parameter used in the admin search. So if you type "T-Shirt" in the search, you should see request.GET look something like:
<QueryDict: {u'q': [u'T-Shirt']}>
You could completely disregard the querystring parameter that your custom action function receives and build your own queryset based on that request.GET's q parameter. Something like:
def increase_price_10_percent(modeladmin, request, queryset):
if request.GET['q'] is None:
# Add some error handling
queryset=Product.objects.filter(name__contains=request.GET['q'])
# Your code to increase price in 10%
increase_price_10_percent.short_description = "Increases price 10% for all products in the search result"
I would make sure to forbid any requests where q is empty. And where you read name__contains you should be mimicking whatever filter you created for the admin of your product object (so, if the search is only looking at the name field, name__contains might suffice; if it looks at the name and description, you would have a more complex filter here in the action function too).
I would also, maybe, add an intermediate page stating what models will be affected and have the user click on "I really know what I'm doing" confirmation button. Look at the code for django.contrib.admin.actions for an example of how to list what objects are being deleted. It should point you in the right direction.
NOTE: the users would still have to select something in the admin page, otherwise the action function would never get called.
This is a more generic solution, is not fully tested(and its pretty naive), so it might break with strange filters. For me works with date filters, foreign key filters, boolean filters.
def publish(modeladmin,request,queryset):
kwargs = {}
for filter,arg in request.GET.items():
kwargs.update({filter:arg})
queryset = queryset.filter(**kwargs)
queryset.update(published=True)