i want to know what are the differences between mod_cache and memcached. I've work in a django site, so i'm using mod_wsgi in apache2. My question is should cache behind apache or django-memcached ?
If you need a simple content cache, where your responses are cached, then mod_cache seems like an easier bet. It's really just meant to cache responses for you to either memory or disk.
Memcached is an entirely different beast in my opinion. It works entirely in memory, but it can work across multiple machines, and your backend layer can access results from the cache. So your backend can put things in the cache so it doesn't have to do any expensive work again. It goes beyond what a normal content cache like mod_cache gives you.
Related
I have a java web server and am currently using the Guava library to handle my in-memory caching, which I use heavily. I now need to expand to multiple servers (2+) for failover and load balancing. In the process, I switched from a in-process cache to Memcache (external service) instead. However, I'm not terribly impressed with the results, as now for nearly every call, I have to make an external call to another server, which is significantly slower than the in-memory cache.
I'm thinking instead of getting the data from Memcache, I could keep using a local cache on each server, and use RabbitMQ to notify the other servers when their caches need to be updated. So if one server makes a change to the underlying data, it would also broadcast a message to all other servers telling them their cache is now invalid. Every server is both broadcasting and listening for cache invalidation messages.
Does anyone know any potential pitfalls of this approach? I'm a little nervous because I can't find anyone else that is doing this in production. The only problems I see would be that each server needs more memory (in-memory cache), and it might take a little longer for any given server to get the updated data. Anything else?
I am a little bit confused about your problem here, so I am going to restate in a way that makes sense to me, then answer my version of your question. Please feel free to comment if I am not in line with what you are thinking.
You have a web application that uses a process-local memory cache for data. You want to expand to multiple nodes and keep this same structure for your program, rather than rely upon a 3rd party tool (memcached, Couchbase, Redis) with built-in cache replication. So, you are thinking about rolling your own using RabbitMQ to publish the changes out to the various nodes so they can update the local cache accordingly.
My initial reaction is that what you want to do is best done by rolling over to one of the above-mentioned tools. In addition to the obvious development and rigorous testing involved, Couchbase, Memcached, and Redis were all designed to solve the problem that you have.
Also, in theory you would run out of available memory in your application nodes as you scale horizontally, and then you will really have a mess. Once you get to the point when this limitation makes your app infeasible, you will end up using one of the tools anyway at which point all your hard work to design a custom solution will be for naught.
The only exceptions to this I can think of are if your app is heavily compute-intensive and does not use much memory. In this case, I think a RabbitMQ-based solution is easy, but you would need to have some sort of procedure in place to synchronize the cache between the servers on occasion, should messages be missed in RMQ. You would also need a way to handle node startup and shutdown.
Edit
In consideration of your statement in the comments that you are seeing access times in the hundreds of milliseconds, I'm going to advise that you first examine your setup. Typical read times for a single item in the cache from a Memcached (or Couchbase, or Redis, etc.) instance are sub-millisecond (somewhere around .1 milliseconds if I remember correctly), so your "problem child" of a cache server is several orders of magnitude from where it should be in terms of performance. Start there, then see if you still have the same problem.
We're using something similar for data which is read-only and doesn't require updated every time. I'm in doubt, that this is good plan for you. Just imagine you should have one more additional service on each instance, which will monitor queue, and process change to in-memory storage. This is very hard to test.
Are you sure that most of the time is spent on communication between your servers? Maybe you run multiple calls?
I have an app which has a search feature. This feature looks up the search term in a giant object (dictionary) that I cache for 24 hours. The object is about 50,000 keys and weighs roughly 10MB.
When I profile the memory usage on my hosting, I notice that after a few queries, the memory usage goes from around 50MB to over 450MB, prompting my hosting provider to kill the app.
So I'm wondering what is going on here. Specifically, how does the cache utilize the memory on each request and what can I do to fix this?
Django FileBasedCache is known for having performance issues. You can get a big picture on the following links:
A smarter filebasedcache for Django
Bug: File based cache not very efficient with large amounts of cached files
Bug was set as wont fix arguing:
I'm going to wontfix, on the grounds that the filesystem cache is intended as an easy way to test caching, not as a serious caching
strategy. The default cache size and the cull strategy implemented by
the file cache should make that obvious.
Consider using a KVS like Memcache or Redis as a caching strategy because they both support expiry. Also, consider a dedicated search like ElasticSearch if more anticipated features will be search-related.
Tools are howtos are available:
Installing memcached for a django project
http://code.google.com/p/memcached/wiki/NewStart
http://redis.io/commands/expire
https://github.com/bartTC/django-memcache-status
http://www.elasticsearch.org/guide/reference/index-modules/cache.html
How do you tune Django for better performance? Is there some guide? I have the following questions:
Is mod_wsgi the best solution?
Is there some opcode cache like in PHP?
How should I tune Apache?
How can I set up my models, so I have fewer/faster queries?
Can I use Memcache?
Comments on a few of your questions:
Is mod_wsgi the best solution?
Apache/mod_wsgi is adequate for most people because they will never have enough traffic to cause problems even if Apache hasn't been set up properly. The web server is generally never the bottleneck.
Is there some opcode cache like in PHP?
Python caches compiled code in memory and the processes persist across requests. You thus don't need a separate opcode caching product like PHP as that is what Python does by default. You just need to ensure you aren't using a hosting mechanism or configuration that would cause the processes to be thrown away on every request or too often. Don't use CGI for example.
How should I tune Apache?
Without knowing anything about your application or the system you are hosting it on one can't give easy guidance as how you need to set up Apache. This is because throughput, duration of requests, amount of memory used, amount of memory available, number of processors and much much more come into play. If you haven't even written your application yet then you are simply jumping the gun here because until you know more about your application and production load you can't optimally tune Apache.
A few simple suggestions though.
Don't host PHP in same Apache.
Use Apache worker MPM.
Use mod_wsgi daemon mode and NOT embedded mode.
This alone will save you from causing too much grief for yourself to begin with.
If you are genuinely needing to better tune your complete stack, ie., application and web server, and not just prematurely optimising because you think you are going to have the next FaceBook even though you haven't really written any code yet, then you need to start looking at performance monitoring tools to work out what your application is doing. Your application and database are going to be the real source of your problems and not the web server.
The sort of performance monitoring tool I am talking about is something like New Relic. Even then though, if you are very early days and haven't got anything deployed even, then that itself would be a waste of time. In other words, just get your code working first before worrying about how to run it optimally.
I have a setup like, nginx in the front for serving static files and reverse proxying to apache for django via mod_wsgi and I want to implement memcached in my setup. I don't have a huge traffic that my server will not handle today but it will get larger soon, it's best to be ready before.
I see two options for me: The first one is using django's native memcached module which handles many things automatically (afaik, confirm on comments pls), such as when a database entry is updated, it removes the related key, and maybe user authenticated pages (confirm please).
The other one is implementing memcached on nginx. The responsible structure for caching should be the front server seems more semantic to me; I am not quite sure of that but it is like division of responsibility. However, if I choose this was, I have to write more code for releasing cache keys on updates and user auth's. This will take some time of course, but I am in no rush.
The first one is the easy way, second one is harder but seems more logical. What would be the best option in terms of manageability and response times and the work required to implement? Would it worth it?
Also, there is only one site I am hosting that would require caching right now, but it will be more sites in the future and they may not be based on python. You might want to consider this.
There may be an advantage to going the nginx route... but I'm not seeing it.
The advantages to using Django's module:
You can set data to cache, such as expensive queries and API call results, rather than be locked into caching the whole view.
It's easy, and then you can get back to making your application cool.
I'm slowly getting into the position where one of my Django sites needs some robustness behind it. I'd currently running on a single VPS on a SQLite database with memcached.. It's about as un-scaled as things can get.
If I bought another VPS account, what would I want to do?
Move to MySQL/PostgreSQL with replication? What's easiest? Does replication protect me from one server exploding? Are there concurrency downsides?
How do I load-balance between the two servers?
I'd put memcached on the new server too. If I put both IPs into the configuration, would that keep a copy of data on both servers? (I'm thinking of what happens to session data - currently stored in memcached)
I'm currently using Cherokee as the httpd - I'm sure this has its own set of issues. If you've any tips, let me know.
Am I going at this the wrong way? Is there an easier way to have faster, more robust django sites?
First step: switch from SQLite to a real production database (I like Postgres). This should happen long before you even think about a second VPS. SQLite essentially does not support concurrency at all. Personally, I wouldn't even consider deploying a live site on SQLite in the first place.
If your site is running on SQLite and is functioning, my guess is you are still quite a long ways from actually outgrowing your single VPS (unless it's already heavily loaded otherwise).
If/when you do need to add a second server, how you configure things depends on where you're actually seeing a bottleneck. Chances are it'll be the database, in which case a good step might be simply moving the database onto its own server (presuming you can guarantee low latency between the two VPSes) and loading the database server with as much RAM as you can afford. In general disk performance suffers most in a VPS, so another step to consider might be putting the DB onto raw metal.
I'd probably look at those steps before I'd think about DB replication or multiple web-tier servers, but it really depends on profiling your actual case (and how you value performance vs reliability).
Watching the Django Deployment Workshop by Jacob Kaplan-Moss should give you a good overview.
MySQL supports Master-Slave and Master-Master setups I don't use PostgreSQL.
You can use nginx as your loadbalancer, HAProxy is an option, too (SO use it).
Memcached distributes the objects over the servers, If one crashes the data is lost.
I don't know Cherokee, but nginx is great.