My Django backend is always dynamic. It serves an iOS app similar to that of Instagram and Vine where users upload photos/videos and their followers can comment and like the content. Just for the sake of this question, imagine my backend serves an iOS app that is exactly like Instagram.
Many sources claim that using memcached can improve performance because it decreases the amount of hits that are made to the database.
My question is, for a backend that is already in dynamic in nature (always changing since users are uploading new pictures, commenting, liking, following new users etc..) what can I possibly cache?
It's a problem I've been thinking about for quite some time. I could cache the user profile data, but other than that, I don't know where else memcached would be useful.
Other sources mentioned using it everywhere in the backend where a 'GET' call is made but then I would need to set a suitable time limit to expire the cache since the app is always dynamic. What are your solutions and suggestions for getting around this problem?
You would cache whatever is being most frequently accessed from your Database. Make a list of the most frequent requests to get data from the database and cache the data in that priority.
Cache the most frequent requests based on category of the pictures
Cache based on users - power users go into cache (those which do a lot of data access)
Cache the most recent inserts (in case you have a page which shows the recently added posts/pictures)
I am sure you can come up with more scenarios. I am positive memcached (or any other caching) will help, even though your app is very 'dynamic'.
Related
I have a data science type application where I am getting public information from FPDS and SAM gov't website. The site is currently on Heroku.
I would like cache views so if a person is researching more than one company they can quickly go back to earlier pages without having to fetch the results from the database every time.
Based on my limited knowledge that is what cashing does?
Second, I am looking at flash-caching and it doesn't appear to be that difficult to implement to the route's I would like to cache.
Now the question is on Heroku, you wouldn't use simplecashe would you? Would you use a different cache strategy? From the docs, the CASHE_TYPE can be simple, redis, memcached and several more. On Heroku would I need to store the cache on something like Redis or can I store it in memory? Ideally, to get everything up and running I would like the cache to be in memory.
Late answer to your question. Caching can be a number of techniques on client and server side to achieve a goal of reduced traffic, network transport, or speed.
I'll focus on one aspect from what you are asking: a redis integration with flask to achieve faster response from a flask app environment. Redis is 'blindingly' fast, imo, as an in-memory database. When I have many users asking for the same view (typically a report-style display), I can interrupt the view route to get the response from a named redis database, so that my flask server is not bound up in eternally regenerating the same contents, which in turn saves a good few cycles of the main back-end database. Of course, if the contents of that view/report change, I have to separately take care of that. Most importantly, Redis includes an expiry value for each entry, so one way of handling stale contents is to delete the redis contents ahead of the expiry time.
Let me know if you want sample code to demonstrate this.
I am building a Django web application and I'd like some advice on caching. I know very little about caching. I've read the caching chapter in the Django book, but am struggling to relate it to my real-world situation.
My application will be a web front-end on a Postgres database containing a largeish amount of data (150GB of server logs).
The database is read-only: the purpose of the application is to give users a simple way to query the data. For example, the user might ask for all rows from server X between dates A and B.
So my database needs to support very fast read operations, but it doesn't need to worry about write operations (much - I'll add new data once every few months, and it doesn't matter how long that takes).
It would be nice if clients making the same request could use a cache, rather than making another call to the Postgres database.
But I don't know what sort of cache I should be looking at: a web cache, or a database cache. Or even if Postgres is the best choice (I'd just like to use it because it works well with Django, and is so robust). Could anyone advise?
The Django book says memcached is the best cache with Django, but it runs in memory, and the results of some of these queries could be several GB, so memcached might fill up the machine's memory quickly. But perhaps I don't fully understand how memcached operates.
Your query should in no way return several GB of data. There's no practical reason to do so, as the user cannot absorb that much data at a time. Your result set should be paged, such that the user sees only 10, 25, whatever results at a time. That then allows you to also limit your query to only fetch 10, 25, whatever records at a time starting from a particular index based on the page number.
Caching search result pages is not a particularly good idea, regardless, though. For one, the odds that different users will ever conduct exactly the same search are pretty minimal, and you'll end up wasting RAM to cache result sets that will never be used again. Also, something like logs should be real-time. If you return a cached result set, there might be new, relevant results that are not included, obscuring the usefulness of your search.
As mentioned above you have limitations on what problems caching can solve. As you are building this application, then I see no reason why you couldn't just plug in Django Haystack and Whoosh and see how it performs, then switching to some of the other more Enterprise search backends is a breeze.
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column. This works, but presents some problems I do not want to deal with:
No transactions
No simple way to keep the filesystem and database in sync
Complicates backups since data is stored in 2 places
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
The concern with this approach is performance. Hitting the database for every image request is not desirable. I need some kind of server-side caching system together with reasonable caching headers. For example, if someone requests "/media/documents/earth.jpg", the cache should be consulted first and if the file is not found there the database should be hit.
Questions:
What is a good cache tool for my purpose?
Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
tl;dr: stop worrying and deliver, "premature optimization is the root of all evil"
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column.
The recommendation for using the file system is that you can have the images served directly by the web server instead of served by the application - web servers are very, very good at serving static files.
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
In general, replication is seldom used for static content. For a high traffic website, you have a dedicated server for static content - Django makes this very easy, that is what MEDIA_URL and STATIC_URL are for. Even if you are starting with the media served by the same web server, it is good practice to have it done by a separate virtual host (for example, have the app at http://www.example.com and the media at http://static.example.com even if serving both from the same machine).
Web servers are so good at serving static content that hardly you will need more than one. In practice you rarely hit the point where a dedicated server is not handling the load anymore, because by that time you will be using a CDN to cut your bandwidth bill, and the CDN will take most of the heat off the server.
If you choose to follow the "store on the file system" recommendation, don't worry about this until deployment, when the time arrives have a deployment expert at your side.
The concern with this approach is performance.
The performance hit you take when storing static content in the database is serving the image: it is somewhat negligible for small files - but for a large file, one app instance (or thread) will be stuck until the download finishes. Don't worry unless your images take too long to download.
Hitting the database for every image request is not desirable.
Honestly, why is that? Databases are designed to take hits. When you choose to store images in the database, performance is in the hands of the DBA now; as a developer you should stop thinking about it. When (and if) you hit any performance bottleneck related to database issues, consult a professional DBA, he will fix it.
1 - What is a good cache tool for my purpose?
Short story: this is static content, do the cache at the network layer (CDN, reverse caching proxy, etc). It is a problem for a professional network engineer, not for the developer.
There are many popular cache backends for Django, IMHO they are overkill for static content.
2 - Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
Use an URL scheme that is unique and hard to guess, for example, with a path component made from a SHA2 hash of the file contents plus some secret token. Restrict service to requests refered by your site to avoid someone re-publishing the file URL. Use expiration headers if appropriate.
3 - If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
Again, ask yourself why are you concerned. The cache layer should be transparent to the developer. I'm not aware of any popular cache solution for Django that don't have such basic concern very well covered.
[update]
Very good points. I understand that replication is seldom used for static content. That's not the point though. How often other people use replication for files has no effect on the fact that not replicating/backing up your database is wrong. Other people may be fine with losing ACID just because some bit of data is binary; I'm not. As far as I'm concerned these files are "of the database" because there are database columns whose values reference the files. If backing up hard drives is something seldom done, does that mean I shouldn't back up my hard drive? NO!
Your concern is valid, I was just trying to explain why Django developers have a bias for this arrangement (dedicated webserver for static content), Django started at the news publishing industry where this approach works well because of its ratio of one trusted publisher for thousands of readers.
It is important to note that the recommended approach (IMHO) is not in ACID violation. Ok, Django does not erase older images stored in the filesystem when the record changes or is deleted - but PostgreSQL don't really erase tuples from disk immediately when you delete records, they are just marked to be vacuumed later. Pity that Django lacks a built-in "vacuum" for images, but it is very hard to write a general one, so I side with the core team - data safety comes first. Look for example at database migrations: they took so long to have database migrations incorporated in Django because it is a hard problem as well. While writing a generic solution is hard, writing specific ones is trivial - for some projects I have a "garbage collector" process that I run from crontab in the low traffic hours, this script simply delete all files that are not referenced by metadata in the database - and this dirty cron job is enough consistency for me.
If you choose to store images at the database that is all fine. There are trade-offs, but rest assured you don't have to worry about them as a developer, it is a problem for the "ops" part of DevOps.
I am currently trying to figure out he best practice in order to design my web services between a django administrated database (+ images) and a mobile app. My main concern is how to separate a bulk update (send every data in the database and all the files on the server) and a lighter, smaller update with only the new and / or modified objects (images or data.)
I have had access to a working code-base using a cronjob and states for each data field (new, modified, up to date) to generate either a reference data file or an update file. I find it to be very redundant and somewhat unelegant, in contradiction with the DRY spirit of Django (there are tons of lines of code, making it nearly unmaintainable.))
I find it very surprising that this aspect is almost un-documented, since web traffic is a crucial matter in mobile developpment.. Fetching everytime all the data served quickly becomes unsustainable as the database grows..
I would be very grateful for any lead or advice you could give me :-) Thx in advance !
Just have a last_modified DateTimeField in your table, and in your user's profile a last_synchronized DateTimeField. When the mobile app wants to synchronize, send the data which was modified after the last synchronization run, and update the last_synchronized field in the user's profile.
I'm a beginner to caching. I'm currently working on a small project with Django and will be implementing caching later via memcached.
I have a page with a video on it and the video has a bunch of comments. The only content on the page that is likely to change regularly is the comments and the "You are logged in as.../You are not logged in..." message.
I was thinking I could create a JSON file that serves the username and most recent comments, including it in the head with <script src="videojson.js"></script>. That way I could populate the HTML via Javascript instead of caching the whole page on a per-user basis.
Is this a suitable approach, or is the caching system smarter than I give it credit for?
How is the JavaScript going to get the json object? Are going to serve from a django view that the us calls? And in that view you will just pull out of memcached if available and DB if not?
That seems reasonable assuming your json isn't very big. If your comments change a lot and you have to spend a lot of time querying the db, building the json object and saving to memcache every time a new comment is written, it won't work well. But if you only fill the cache when your json expires, and you don't care about having the latest and greatest comments on there instantly, it should work.
One thing to point out is that if you aren't getting that much traffic now, you might be adding a level of complexity that won't give you much return on your time spent. But if you are using this to learn how to do caching then it is a good exercise.
Hope that helps