I'm developing a Django application on a shared server (Dreamhost).
A view that I'm implementing takes several HTTP GET parameters to perform database lookups and return serialized data. Some of these lookups generate several hundreds of kilobytes of data that is expensive to compute. Caching this data would be ideal as it would save both DB access and computation time. I have two questions:
The Django documentation mentions that the cache middleware doesn't cache requests with GET or POST parameters. Is there any way around this?
The Dreamhost wiki indicates that either Filesystem caching or Database caching are best suited for Dreamhost sites. Which of these will be better in terms of performance, setup, and maintainability. Are there any alternatives for shared hosting?
I'm also open to suggestions for other solutions to my problem.
Thanks in advance!
-Advait
About the cache requests with GET parameters:
Cache a django view that has URL parameters
Filesystem caching is usually fast enough, easy to setup, and maintenance is same as managing any directory. Delete the cache by removing the files in the cache directory.
Related
Which of the caching strategies can be implemented in django?
What are the pros and cons of each cache backends (in terms of ease of use and ease of developing)?
Which backend should be preferred for production etc.
There are several backends that Django supports.
Listing a few here with comments around the positives and negatives of each.
Memcached: The gold standard for caching. An in-memory service that can return keys at a very fast rate. Not a good choice if your keys are very large in size
Redis: A good alternative to Memcached when you want to cache very large keys (for example, large chunks of rendered JSON for an API).
Dynamodb: Another good alternative to Memcached when you want to cache very large keys. Also scales very well with little IT overhead.
Localmem: Only use for local testing; don’t go into production with this cache type
Database: It’s rare that you’ll find a use case where the database caching makes sense. It may be useful for local testing, but otherwise, avoid.
File system: Can be a trap. Although reading and writing files can be faster than making sql queries, it has some pitfalls. Each cache is local to the application server (not shared), and if you have a lot of cache keys, you can theoretically hit the file system limit for number of files allowed.
Dummy: A great backend to use for local testing when you want your data changes to be made immediately without caching. Be warned: permanently using dummy caching locally can hide bugs from you until they hit an environment where caching is enabled.
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column. This works, but presents some problems I do not want to deal with:
No transactions
No simple way to keep the filesystem and database in sync
Complicates backups since data is stored in 2 places
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
The concern with this approach is performance. Hitting the database for every image request is not desirable. I need some kind of server-side caching system together with reasonable caching headers. For example, if someone requests "/media/documents/earth.jpg", the cache should be consulted first and if the file is not found there the database should be hit.
Questions:
What is a good cache tool for my purpose?
Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
tl;dr: stop worrying and deliver, "premature optimization is the root of all evil"
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column.
The recommendation for using the file system is that you can have the images served directly by the web server instead of served by the application - web servers are very, very good at serving static files.
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
In general, replication is seldom used for static content. For a high traffic website, you have a dedicated server for static content - Django makes this very easy, that is what MEDIA_URL and STATIC_URL are for. Even if you are starting with the media served by the same web server, it is good practice to have it done by a separate virtual host (for example, have the app at http://www.example.com and the media at http://static.example.com even if serving both from the same machine).
Web servers are so good at serving static content that hardly you will need more than one. In practice you rarely hit the point where a dedicated server is not handling the load anymore, because by that time you will be using a CDN to cut your bandwidth bill, and the CDN will take most of the heat off the server.
If you choose to follow the "store on the file system" recommendation, don't worry about this until deployment, when the time arrives have a deployment expert at your side.
The concern with this approach is performance.
The performance hit you take when storing static content in the database is serving the image: it is somewhat negligible for small files - but for a large file, one app instance (or thread) will be stuck until the download finishes. Don't worry unless your images take too long to download.
Hitting the database for every image request is not desirable.
Honestly, why is that? Databases are designed to take hits. When you choose to store images in the database, performance is in the hands of the DBA now; as a developer you should stop thinking about it. When (and if) you hit any performance bottleneck related to database issues, consult a professional DBA, he will fix it.
1 - What is a good cache tool for my purpose?
Short story: this is static content, do the cache at the network layer (CDN, reverse caching proxy, etc). It is a problem for a professional network engineer, not for the developer.
There are many popular cache backends for Django, IMHO they are overkill for static content.
2 - Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
Use an URL scheme that is unique and hard to guess, for example, with a path component made from a SHA2 hash of the file contents plus some secret token. Restrict service to requests refered by your site to avoid someone re-publishing the file URL. Use expiration headers if appropriate.
3 - If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
Again, ask yourself why are you concerned. The cache layer should be transparent to the developer. I'm not aware of any popular cache solution for Django that don't have such basic concern very well covered.
[update]
Very good points. I understand that replication is seldom used for static content. That's not the point though. How often other people use replication for files has no effect on the fact that not replicating/backing up your database is wrong. Other people may be fine with losing ACID just because some bit of data is binary; I'm not. As far as I'm concerned these files are "of the database" because there are database columns whose values reference the files. If backing up hard drives is something seldom done, does that mean I shouldn't back up my hard drive? NO!
Your concern is valid, I was just trying to explain why Django developers have a bias for this arrangement (dedicated webserver for static content), Django started at the news publishing industry where this approach works well because of its ratio of one trusted publisher for thousands of readers.
It is important to note that the recommended approach (IMHO) is not in ACID violation. Ok, Django does not erase older images stored in the filesystem when the record changes or is deleted - but PostgreSQL don't really erase tuples from disk immediately when you delete records, they are just marked to be vacuumed later. Pity that Django lacks a built-in "vacuum" for images, but it is very hard to write a general one, so I side with the core team - data safety comes first. Look for example at database migrations: they took so long to have database migrations incorporated in Django because it is a hard problem as well. While writing a generic solution is hard, writing specific ones is trivial - for some projects I have a "garbage collector" process that I run from crontab in the low traffic hours, this script simply delete all files that are not referenced by metadata in the database - and this dirty cron job is enough consistency for me.
If you choose to store images at the database that is all fine. There are trade-offs, but rest assured you don't have to worry about them as a developer, it is a problem for the "ops" part of DevOps.
My Django backend is always dynamic. It serves an iOS app similar to that of Instagram and Vine where users upload photos/videos and their followers can comment and like the content. Just for the sake of this question, imagine my backend serves an iOS app that is exactly like Instagram.
Many sources claim that using memcached can improve performance because it decreases the amount of hits that are made to the database.
My question is, for a backend that is already in dynamic in nature (always changing since users are uploading new pictures, commenting, liking, following new users etc..) what can I possibly cache?
It's a problem I've been thinking about for quite some time. I could cache the user profile data, but other than that, I don't know where else memcached would be useful.
Other sources mentioned using it everywhere in the backend where a 'GET' call is made but then I would need to set a suitable time limit to expire the cache since the app is always dynamic. What are your solutions and suggestions for getting around this problem?
You would cache whatever is being most frequently accessed from your Database. Make a list of the most frequent requests to get data from the database and cache the data in that priority.
Cache the most frequent requests based on category of the pictures
Cache based on users - power users go into cache (those which do a lot of data access)
Cache the most recent inserts (in case you have a page which shows the recently added posts/pictures)
I am sure you can come up with more scenarios. I am positive memcached (or any other caching) will help, even though your app is very 'dynamic'.
I have an app which has a search feature. This feature looks up the search term in a giant object (dictionary) that I cache for 24 hours. The object is about 50,000 keys and weighs roughly 10MB.
When I profile the memory usage on my hosting, I notice that after a few queries, the memory usage goes from around 50MB to over 450MB, prompting my hosting provider to kill the app.
So I'm wondering what is going on here. Specifically, how does the cache utilize the memory on each request and what can I do to fix this?
Django FileBasedCache is known for having performance issues. You can get a big picture on the following links:
A smarter filebasedcache for Django
Bug: File based cache not very efficient with large amounts of cached files
Bug was set as wont fix arguing:
I'm going to wontfix, on the grounds that the filesystem cache is intended as an easy way to test caching, not as a serious caching
strategy. The default cache size and the cull strategy implemented by
the file cache should make that obvious.
Consider using a KVS like Memcache or Redis as a caching strategy because they both support expiry. Also, consider a dedicated search like ElasticSearch if more anticipated features will be search-related.
Tools are howtos are available:
Installing memcached for a django project
http://code.google.com/p/memcached/wiki/NewStart
http://redis.io/commands/expire
https://github.com/bartTC/django-memcache-status
http://www.elasticsearch.org/guide/reference/index-modules/cache.html
I have a setup like, nginx in the front for serving static files and reverse proxying to apache for django via mod_wsgi and I want to implement memcached in my setup. I don't have a huge traffic that my server will not handle today but it will get larger soon, it's best to be ready before.
I see two options for me: The first one is using django's native memcached module which handles many things automatically (afaik, confirm on comments pls), such as when a database entry is updated, it removes the related key, and maybe user authenticated pages (confirm please).
The other one is implementing memcached on nginx. The responsible structure for caching should be the front server seems more semantic to me; I am not quite sure of that but it is like division of responsibility. However, if I choose this was, I have to write more code for releasing cache keys on updates and user auth's. This will take some time of course, but I am in no rush.
The first one is the easy way, second one is harder but seems more logical. What would be the best option in terms of manageability and response times and the work required to implement? Would it worth it?
Also, there is only one site I am hosting that would require caching right now, but it will be more sites in the future and they may not be based on python. You might want to consider this.
There may be an advantage to going the nginx route... but I'm not seeing it.
The advantages to using Django's module:
You can set data to cache, such as expensive queries and API call results, rather than be locked into caching the whole view.
It's easy, and then you can get back to making your application cool.