implementing memcached: native django's modules or on (front server, reverse proxy) nginx - django

I have a setup like, nginx in the front for serving static files and reverse proxying to apache for django via mod_wsgi and I want to implement memcached in my setup. I don't have a huge traffic that my server will not handle today but it will get larger soon, it's best to be ready before.
I see two options for me: The first one is using django's native memcached module which handles many things automatically (afaik, confirm on comments pls), such as when a database entry is updated, it removes the related key, and maybe user authenticated pages (confirm please).
The other one is implementing memcached on nginx. The responsible structure for caching should be the front server seems more semantic to me; I am not quite sure of that but it is like division of responsibility. However, if I choose this was, I have to write more code for releasing cache keys on updates and user auth's. This will take some time of course, but I am in no rush.
The first one is the easy way, second one is harder but seems more logical. What would be the best option in terms of manageability and response times and the work required to implement? Would it worth it?
Also, there is only one site I am hosting that would require caching right now, but it will be more sites in the future and they may not be based on python. You might want to consider this.

There may be an advantage to going the nginx route... but I'm not seeing it.
The advantages to using Django's module:
You can set data to cache, such as expensive queries and API call results, rather than be locked into caching the whole view.
It's easy, and then you can get back to making your application cool.

Related

What exactly is caching and how do I add it to an app I have on heroku?

I have a data science type application where I am getting public information from FPDS and SAM gov't website. The site is currently on Heroku.
I would like cache views so if a person is researching more than one company they can quickly go back to earlier pages without having to fetch the results from the database every time.
Based on my limited knowledge that is what cashing does?
Second, I am looking at flash-caching and it doesn't appear to be that difficult to implement to the route's I would like to cache.
Now the question is on Heroku, you wouldn't use simplecashe would you? Would you use a different cache strategy? From the docs, the CASHE_TYPE can be simple, redis, memcached and several more. On Heroku would I need to store the cache on something like Redis or can I store it in memory? Ideally, to get everything up and running I would like the cache to be in memory.
Late answer to your question. Caching can be a number of techniques on client and server side to achieve a goal of reduced traffic, network transport, or speed.
I'll focus on one aspect from what you are asking: a redis integration with flask to achieve faster response from a flask app environment. Redis is 'blindingly' fast, imo, as an in-memory database. When I have many users asking for the same view (typically a report-style display), I can interrupt the view route to get the response from a named redis database, so that my flask server is not bound up in eternally regenerating the same contents, which in turn saves a good few cycles of the main back-end database. Of course, if the contents of that view/report change, I have to separately take care of that. Most importantly, Redis includes an expiry value for each entry, so one way of handling stale contents is to delete the redis contents ahead of the expiry time.
Let me know if you want sample code to demonstrate this.

Should I implement revisioning using database triggers or using django-reversion?

We're looking into implementing audit logs in our application and we're not sure how to do it correctly.
I know that django-reversion works and works well but there's a cost of using it.
The web server will have to make two roundtrips to the database when saving a record even if the save is in the same transaction because at least in postgres the changes are written to the database and comitting the transaction makes the changes visible.
So this will block the web server until the revision is saved to the database if we're not using async I/O which is currently the case. Even if we would use async I/O generating the revision's data takes CPU time which again blocks the web server from handling other requests.
We can use database triggers instead but our DBA claims that offloading this sort of work to the database will use resources that are meant for handling more transactions.
Is using database triggers for this sort of work a bad idea?
We can scale both the web servers using a load balancer and the database using read/write replicas.
Are there any tradeoffs we're missing here?
What would help us decide?
You need to think about the pattern of db usage in your website.
Which may be unique to you, however most web apps read much more often than they write to the db. In fact it's fairly common to see optimisations done, to help scaling a web app, which trade off more complicated 'save' operations to get faster reads. An example would be denormalisation where some data from related records is copied to the parent record on each save so as to avoid repeatedly doing complicated aggregate/join queries.
This is just an example, but unless you know your specific situation is different I'd say don't worry about doing a bit of extra work on save.
One caveat would be to consider excluding some models from the revisioning system. For example if you are using Django db-backed sessions, the session records are saved on every request. You'd want to avoid doing unnecessary work there.
As for doing it via triggers vs Django app... I think the main considerations here are not to do with performance:
Django app solution is more 'obvious' and 'maintainable'... the app will be in your pip requirements file and Django INSTALLED_APPS, it's obvious to other developers that it's there and working and doesn't need someone to remember to run the custom SQL on the db server when you move to a new server
With a db trigger solution you can be certain it will run whenever a record is changed by any means... whereas with Django app, anyone changing records via a psql console will bypass it. Even in the Django ORM, certain bulk operations bypass the model save method/save signals. Sometimes this is desirable however.
Another thing I'd point out is that your production webserver will be multiprocess/multithreaded... so although, yes, a lengthy db write will block the webserver it will only block the current process. Your webserver will have other processes which are able to server other requests concurrently. So it won't block the whole webserver.
So again, unless you have a pattern of usage where you anticipate a high frequency of concurrent writes to the db, I'd say probably don't worry about it.

Tracing requests of users by logging their actions to DB in django

I want to trace user's actions in my web site by logging their requests to database as plain text in Django.
I consider to write a custom decorator and place it to every view that I want to trace.
However, I have some troubles in my design.
First of all, is such logging mecahinsm reasonable or because of my log table will be enlarging rapidly it causes some preformance problems ?
Secondly, how should be my log table's design ?
I want to keep keywords if the user call search view or keep the item's id if the user call details of item view.
Besides, IP addresses of user's should be kept but how can I seperate users if they connect via single IP address as in many companies.
I am glad to explain in detail if you think my question is unclear.
Thanks
I wouldn't do that. If this is a production service then you've got a proper web server running in front of it, right? Apache, or nginx or something. That can do logging, and can do it well, and can write to a form that won't bloat your database, and there's a wealth of analytical tools for log analysis.
You are going to have to duplicate a lot of that functionality in your decorator, such as when you want to switch it on or off, or change the log level. The only thing you'll get by doing it all in django is the possibility of ultra-fine control, such as only logging views of blog posts with id numbers greater than X or something. But generally you'd not want that level of detail, and you'd log everything and do any stripping at the analysis phase. You've not given any reason currently why you need to do it from Django.
If you really want it in a RDBMS, reading an apache log file into Postgres or MySQL or one of those expensive ones is fairly trivial.
One thing you should keep in mind is that SQL databases don't offer you a very good writing performance (in comparison with reading), so if you are experiencing heavy loads you should probably look for a better in-memory solution (eg. some key-value-store like redis).
But keep in mind, that, especially if you would use a non-sql solution you should be aware what you want to do with the collected data (just display something like a 'log' or do some more in-deep searching/querying on the data).
If you want to identify different users from the same IP address you should probably look for a cookie-based solution (if you are using django's session framework the session's are per default identified through a cookie - so you could just simply use sessions). Another solution could be doing the logging 'asynchronously' via javascript after the page has loaded in the browser (which could give you more possibilities in identifying the user and avoid additional load when generating the page).

How can I scale a webapp with long response time, which currently uses django

I am writing a web application with django on the server side. It takes ~4 seconds for server to generate a response to the user. It makes use of a weather api. My application has to make ~50 query to that api for each user request.
Server side uses urllib of python for using the weather api. I used pythons threading to speed up the process because urllib is synchronous. I am using wsgi with apache. The problem is wsgi stack is fully synchronous and when many users use my application, they have to wait for one anothers request to finish. Since each request takes ~4 seconds, this is unacceptable.
I am kind of stuck, what can I do?
Thanks
If you are using mod_wsgi in a multithreaded configuration, or even a multi process configuration, one request should not block another from being able to do something. They should be able to run concurrently. If using a multithreaded configuration, are you sure that you aren't using some locking mechanism on some resource within your own application which precludes requests running through the same section of code? Another possibility is that you have configured Apache MPM and/or mod_wsgi daemon mode poorly so as to preclude concurrent requests.
Anyway, as mentioned in another answer, you are much better off looking at caching strategies to avoid the weather lookups in the first place, or offloading to client.
50 queries to an outside resource per request is probably a bad place to be, and probably not neccesary at all.
The weather doesn't change all that quickly, and so you can probably benefit enormously by just caching results for a while. Then it doesn't matter how many requests you're getting, you don't need to do more than a few queries per day
If that's not your situation, you might be able to get the client to do the work for you. Refactor the code so that the weather api aggregation happens on the client in javascript, rather than funneling it all through the server.
Edit: based on comments you've posted, what you are asking for probably cannot be optimized within the constraints of the API you are using. The problem is that the service is doing a good job of abstracting away the differences in the many sources of weather information they aggregate into a nearest location query. after all, weather stations provide only point data.
If you talk directly to the technical support people that provide the API, you might find that they are willing to support more complex queries (bounding box), for which they will give you instructions. More likely, though, they abstract that away because they don't want to actually reveal the resolution that their API actually provides, or because there is some technical reason in the way that they model their data or perform their calculations that would make such queries too difficult to support.
Without that or caching, you are just out of luck.

Django redundancy and replication over two VPS accounts

I'm slowly getting into the position where one of my Django sites needs some robustness behind it. I'd currently running on a single VPS on a SQLite database with memcached.. It's about as un-scaled as things can get.
If I bought another VPS account, what would I want to do?
Move to MySQL/PostgreSQL with replication? What's easiest? Does replication protect me from one server exploding? Are there concurrency downsides?
How do I load-balance between the two servers?
I'd put memcached on the new server too. If I put both IPs into the configuration, would that keep a copy of data on both servers? (I'm thinking of what happens to session data - currently stored in memcached)
I'm currently using Cherokee as the httpd - I'm sure this has its own set of issues. If you've any tips, let me know.
Am I going at this the wrong way? Is there an easier way to have faster, more robust django sites?
First step: switch from SQLite to a real production database (I like Postgres). This should happen long before you even think about a second VPS. SQLite essentially does not support concurrency at all. Personally, I wouldn't even consider deploying a live site on SQLite in the first place.
If your site is running on SQLite and is functioning, my guess is you are still quite a long ways from actually outgrowing your single VPS (unless it's already heavily loaded otherwise).
If/when you do need to add a second server, how you configure things depends on where you're actually seeing a bottleneck. Chances are it'll be the database, in which case a good step might be simply moving the database onto its own server (presuming you can guarantee low latency between the two VPSes) and loading the database server with as much RAM as you can afford. In general disk performance suffers most in a VPS, so another step to consider might be putting the DB onto raw metal.
I'd probably look at those steps before I'd think about DB replication or multiple web-tier servers, but it really depends on profiling your actual case (and how you value performance vs reliability).
Watching the Django Deployment Workshop by Jacob Kaplan-Moss should give you a good overview.
MySQL supports Master-Slave and Master-Master setups I don't use PostgreSQL.
You can use nginx as your loadbalancer, HAProxy is an option, too (SO use it).
Memcached distributes the objects over the servers, If one crashes the data is lost.
I don't know Cherokee, but nginx is great.