Django caching a large list - django

My django application deals with 25MB binary files. Each of them has about 100,000 "records" of 256 bytes each.
It takes me about 7 seconds to read the binary file from disk and decode it using python's struct module. I turn the data into a list of about 100,000 items, where each item is a dictionary with values of various types (float, string, etc.).
My django views need to search through this list. Clearly 7 seconds is too long.
I've tried using django's low-level caching API to cache the whole list, but that won't work because there's a maximum size limit of 1MB for any single cached item. I've tried caching the 100,000 list items individually, but that takes a lot more than 7 seconds - most of the time is spent unpickling the items.
Is there a convenient way to store a large list in memory between requests? Can you think of another way to cache the object for use by my django app?

edit the item size limit to be 10m (larger than 1m), add
-I 10m
to /etc/memcached.conf and restart memcached
also edit this class in memcached.py located in /usr/lib/python2.7/dist-packages/django/core/cache/backends to look like this:
class MemcachedCache(BaseMemcachedCache):
"An implementation of a cache binding using python-memcached"
def __init__(self, server, params):
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = 1024*1024*10 #added limit to accept 10mb
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)

I'm not able to add comments yet, but I wanted to share my quick fix around this problem, since I had the same problem with python-memcached behaving strangely when you change the SERVER_MAX_VALUE_LENGTH at import time.
Well, besides the __init__ edit that FizxMike suggests you can also edit the _cache property in the same class. Doing so you can instantiate the python-memcached Client passing the server_max_value_length explicitly, like this:
from django.core.cache.backends.memcached import BaseMemcachedCache
DEFAULT_MAX_VALUE_LENGTH = 1024 * 1024
class MemcachedCache(BaseMemcachedCache):
def __init__(self, server, params):
#options from the settings['CACHE'][connection]
self._options = params.get("OPTIONS", {})
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = self._options.get('SERVER_MAX_VALUE_LENGTH', DEFAULT_MAX_VALUE_LENGTH)
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)
#property
def _cache(self):
if getattr(self, '_client', None) is None:
server_max_value_length = self._options.get("SERVER_MAX_VALUE_LENGTH", DEFAULT_MAX_VALUE_LENGTH)
#one could optionally send more parameters here through the options settings,
#I simplified here for brevity
self._client = self._lib.Client(self._servers,
server_max_value_length=server_max_value_length)
return self._client
I also prefer to create another backend that inherits from BaseMemcachedCache and use it instead of editing django code.
here's the django memcached backend module for reference:
https://github.com/django/django/blob/master/django/core/cache/backends/memcached.py
Thanks for all the help on this thread!

Related

How to invalidate several django cached values efficiently?

TLDR
Is there a way to mark cached values so I could do something like:
cache.filter('some_tag').clear()
Details
In my project I have the following model:
class Item(models.Model):
month = models.DateField('month', null=False, blank=False, db_index=True)
kg = models.BigIntegerField('kg')
tags = models.ManyToManyField('Tag', related_name='items')
// bunch of other fields used to filter data
And I have a report_view that returns the sum of kg by month and by tag according to the filters supplied in the URL query.
Something like this:
--------------------------------
|Tag |jan |fev |mar |
--------------------------------
|Tag 1 |1000 |1500 |2000 |
--------------------------------
|Tag 2 |1235 |4652 |0 |
--------------------------------
As my Item table has already more than 4 million records and is always growing my report_view is cached.
So far I got all of this covered.
The problem is: the site user can change the tags from the Items and every time this occurs I have to invalidate the cache, but I would like to do it in a more granular way.
For example if a user changes a tag in a Item from january that should invalidate all the totals for that month (I prefer to cache by month because sometimes changing one tag has a cascading effect on others). However I don't know all the views that have been cached as there are thousands of possibilities of different filters that change the URL.
What I have done so far:
Set a signal to invalidate all my caches when a tag changes
#receiver(m2m_changed, sender=Item.tags.through)
def tags_changed(sender, **kwargs):
cache.clear()
But this cleans everything which is not optimal in my case. Is there a way of doing something like cache.filter('some_tag').clear() with Django cache framework?
https://martinfowler.com/bliki/TwoHardThings.html
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
Presuming you are using Django's Cache Middleware, you'll need to target the cache keys that are relevant. You can see how they generate the cache key from these two files in the Django Project:
-
https://github.com/django/django/blob/master/django/middleware/cache.py#L99
- https://github.com/django/django/blob/master/django/utils/cache.py#L367
- https://github.com/django/django/blob/master/django/utils/cache.py#L324
_generate_cache_key
def _generate_cache_key(request, method, headerlist, key_prefix):
"""Return a cache key from the headers given in the header list."""
ctx = hashlib.md5()
for header in headerlist:
value = request.META.get(header)
if value is not None:
ctx.update(force_bytes(value))
url = hashlib.md5(force_bytes(iri_to_uri(request.build_absolute_uri())))
cache_key = 'views.decorators.cache.cache_page.%s.%s.%s.%s' % (key_prefix, method, url.hexdigest(), ctx.hexdigest())
return _i18n_cache_key_suffix(request, cache_key)
The cache key is generated based on attributes and headers from the request and hashed values (i.e. the url is hashed and used as part of the key). The Vary header in your response specifies other headers to use as part of the cache it.
If you understand how Django is caching your views and calculating your cache keys, then you can use this to target appropriate cache entries, but this is still very difficult because the url is hashed you can't target url patterns (you could use https://stackoverflow.com/a/35629796/784648 cache.delete_patterns(...) otherwise).
Django primarily relies on timeout to invalidate the cache.
I would recommend looking into Django Cacheops, this package is designed to work with Django's ORM to cache and invalidate QuerySets. This seems a lot more practical for your needs because you want fine-grained invalidation on your Item QuerySets, you simply will not get that from Django's Cache Middleware. Take a look at the github repo, I've used it and it works well if you take the time to read the docs and understand it.

Flask: streaming file with stream_with_context is very slow

The following code streams a postgres BYTEA column to a browser
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
return Response(stream_with_context(file.data), mimetype=file.mime_type)
it is extreemely slow (aprox 6 minutes for 5 mb).
I am downloading with curl from the same host, so network is not the issue,
also I can extract the file from the psql console in less than a second,
so it seems the database side is also not to blame :
COPY (select f.data from z_file f where f.id = '4ec3rf') TO 'zazX.pdf' (FORMAT binary)
Update:
I have further evidence that the "fetch from the DB" step is not slow, If I write file.data to a file using
with open("/vagrant/zoz.pdf", 'wb') as output:
output.write(file.data)
it also takes a fraction of a second. So the slowness is caused by the way Flask does the streaming.
I had this issue while using Flask to proxy streaming from another url using python-requests.
In this use case, the trick is setting the chunk_size parameter in iter_content:
def flask_view():
...
req = requests.get(url, stream=True, params=args)
return Response(
stream_with_context(req.iter_content(chunk_size=1024)),
content_type=req.headers['content-type']
otherwise it will use chunk_size=1, which can slow things down quite a bit. In my case the streaming went from a couple of kb/s to several mb/s after the increase in chunk_size.
Flask can be given a generator that returns the whole array in a single yield and will "know" how to deal with it, this returns in milliseconds :
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
def single_chunk_generator():
yield file.data
return Response(stream_with_context(single_chunk_generator()), mimetype=file.mime_type)
stream_with_context, when given an array will create a generator that iterates through it and do various checks on every element, which causes a huge performance hit.

How to invalidate cache_page in Django?

Here is the problem: I have blog app and I cache the post output view for 5 minutes.
#cache_page(60 * 5)
def article(request, slug):
...
However, I'd like to invalidate the cache whenever a new comment is added to the post.
I'm wondering how best to do so?
I've seen this related question, but it is outdated.
I would cache in a bit different way:
def article(request, slug):
cached_article = cache.get('article_%s' % slug)
if not cached_article:
cached_article = Article.objects.get(slug=slug)
cache.set('article_%s' % slug, cached_article, 60*5)
return render(request, 'article/detail.html', {'article':cached_article})
then saving the new comment to this article object:
# ...
# add the new comment to this article object, then
if cache.get('article_%s' % article.slug):
cache.delete('article_%s' % article.slug)
# ...
This was the first hit for me when searching for a solution, and the current answer wasn't terribly helpful, so after a lot of poking around Django's source, I have an answer for this one.
Yes you can know the key programmatically, but it takes a little work.
Django's page caching works by referencing the request object, specifically the request path and query string. This means that for every request to your page that has a different query string, you will have a different cache key. For most cases, this isn't likely to be a problem, since the page you want to cache/invalidate will be a known string like /blog/my-awesome-year, so to invalidate this, you just need to use Django's RequestFactory:
from django.core.cache import cache
from django.test import RequestFactory
from django.urls import reverse
from django.utils.cache import get_cache_key
cache.delete(get_cache_key(RequestFactory().get("/blog/my-awesome-year")))
If your URLs are a fixed list of values (ie. no differing query strings) then you can stop here. However if you've got lots of different query strings (say ?q=xyz for a search page or something), then your best bet is probably to create a separate cache for each view. Then you can just pass cache="cachename" to cache_page() and later clear that entire cache with:
from django.core.cache import caches
caches["my_cache_name"].clear()
Important note about this tactic
It only really works for unauthenticated pages. The minute your user is logged in, the cookie data is made part of the cache key creation process, and therefore re-creating that key programmatically becomes much harder. I suppose you could try pulling the cookie data out of your session store, but there could be thousands of keys in there, and you'd have to invalidate/pre-cache each and every one of them.

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Django custom filter in admin using multiple databases

I have multiple databases, one Django managed and one external containing relevant information for filtering down inside Django admin using SimpleListFilter.
As I don't have foreign keys across databases due to limitations in Django, I'm doing a lookup in the external database to fetch for example a target version number. Based on that lookup list I am able to reduce my queryset.
Now the problem is that my database is too large to filter down that way, as the resulting SQL query looks like the following:
SELECT 'status'.'id', 'status'.'service_number', 'status'.'status'
FROM 'status'
WHERE ('status'.'service_number' = '01xxx' OR 'status'.'service_number' = '02xxx' OR 'status'.'service_number' = '03xxx' ......
The list of OR's is too long and the reduce cannot be done in the database anymore, the error received is:
Django Version: 1.4.4
Exception Type: DatabaseError
Exception Value: (1153, "Got a packet bigger than 'max_allowed_packet' bytes")
I increased already max_allowed_packet in MySQL, but this time I don't think it is the right way to simply increase that value again.changed
My SimpleListFilter looks like:
class TargetFilter(SimpleListFilter):
parameter_name = 'target'
def lookups(self, request, model_admin):
return (
('v1', 'V1.0'),
('v2', 'V2.0'),
)
def queryset(self, request, queryset):
if self.value():
lookup = []
for i in Target.objects.using('externaldb').filter(target=self.value()).values('service_number').distinct():
lookup.append(str(i['service_number']))
qlist = [Q(**{'service_number': f}) for f in lookup]
queryset = queryset.filter(reduce(operator.or_, qlist))
return queryset
The listed code worked for years, but became fast slower and now isn't working at all. I've tried to use frozensets, but this doesn't seem to work.
Do you have an idea on how I can reduce very large sets?
Thanks for any hint!
A little too late to the game, but I just started working with Django a couple of weeks ago and I had huge tables to work with. For large queries I used RAW Sql in Django and cursors as a matter of fact to limit the returned output, and process it in batches. Querysets get everything and loads it in memory which is not practical in large deployments.
Look at https://docs.djangoproject.com/en/3.0/topics/db/sql/ and more specifically Executing custom SQL directly