I have a django app I want to migrate to dotcloud.
Many actions in Django internals and in my app are not asynchronous, i.e. they block the thread until they finish.
When I was using Apache, that didn't pose a problem since a different thread is opened on every request. But it doesn't seem to be the case in nginx/uwsgi that dotcloud use.
Seemingly, uwsgi has a --enable-threads and --threads options that can be used for multithreading, but:
It is not clear what version of uwsgi dotcloud use, and if they support these features
Since I have no one else asking about this, I was wondering if this is really the right way to get the concurrent requests running (using threads)
You could run Django with Gunicorn. Gunicorn, in turn, supports multiple worker classes, and people reported success running gunicorn+gevents+django together[1][2].
To use that on dotCloud, you will probably have to use dotCloud's custom service. If that's something that you want to try, I would personally start with dotCloud's reimplementation of python service using the custom service, and replace uwsgi with gunicorn in it.
I came here looking for some leads, which I found, thanks!
There was a fair amount of leg work left to actually get stuff working, though.
Here is an example app on github that uses gunicorn, gevent, and socketio on dotcloud:
https://github.com/t1m0thy/django-tictactoe/tree/dotcloud
Threads is a problem in python - GIL doesn't allow them to run simultaneously.
So multiprocessing is an answer.
Or you may take a look at gevent. Actually gevent is a kind of a hack (monkey patching of python stack) and so on, but it allows to launch green threads.
I'm not sure if gevent can be combined with django, but google knows ;)
Related
Introduction
I encountered this very interesting issue this week, better start with some facts:
pylibmc is not thread safe, when used as django memcached backend, starting multiple django instance directly in shell would crash when hit with concurrent requests.
if deploy with nginx + uWSGI, this problem with pylibmc magically dispear.
if you switch django cache backend to python-memcached, it too will solve this problem, but this question isn't about that.
Elaboration
start with the first fact, this is how I reproduced the pylibmc issue:
The failure of pylibmc
I have a django app which does a lot of memcached reading and writing, and there's this deployment strategy, that I start multiple django process in shell, binding to different ports (8001, 8002), and use nginx to do the balance.
I initiated two separate load test against these two django instance, using locust, and this is what happens:
In the above screenshot they both crashed and reported exactly the same issue, something like this:
Assertion "ptr->query_id == query_id +1" failed for function "memcached_get_by_key" likely for "Programmer error, the query_id was not incremented.", at libmemcached/get.cc:107
uWSGI to the rescue
So in the above case, we learned that multi-thread concurrent request towards memcached via pylibmc could cause issue, this somehow doesn't bother uWSGI with multiple worker process.
To prove that, I start uWSGI with the following settings included:
master = true
processes = 2
This tells uWSGI to start two worker process, I then tells nginx to server any django static files, and route non-static requests to uWSGI, to see what happens. With the server started, I launch the same locust test against django in localhost, and make sure there's enough requests per seconds to cause concurrent request against memcached, here's the result:
In the uWSGI console, there's no sign of dead worker processes, and no worker has been re-spawn, but looking at the upper part of the screenshot, there sure has been concurrent requests (5.6 req/s).
The question
I'm extremely curious about how uWSGI make this go away, and I couldn't learn that on their documentation, to recap, the question is:
How did uWSGI manage worker process, so that multi-thread memcached requests didn't cause django to crash?
In fact I'm not even sure that it's the way uWSGI manages worker processes that avoid this issue, or some other magic that comes with uWSGI that's doing the trick, I've seen something called a memcached router in their documentation that I didn't quite understand, does that relate?
Isn't it because you actually have two separate processes managed by uWSGI? As you are setting the processes option instead of the workers option, so you should actually have multiple uWSGI processes (I'm assuming a master + two workers because of the config you used). Each of those processes will have it's own loaded pylibmc, so there is not state sharing between threads (you haven't configured threads on uWSGI after all).
I'm trying to do a Django application with an asynchronous part: Websockets. Just as a little challenge, I want to mount everything in the same process. Tried Socket.IO but couldn't manage to actually use sockets, instead of longpolling (which killed my browser several times, until I gave up).
What I then tried was a not-so-maintained library based on gevent-websocket. However, had many errors and was not easy to debug.
Now I am trying a Tornado approach but AFAIK (please correct me if I'm wrong) integrating async with a regular django app wrapped by WSGIContainer (websockets would go through Tornado, regular connections through Django) will be a true server killer if a resource is heavy or, somehow, the Django ORM goes slow into heavy operations.
I was thinking on moving to Twisted/Cyclone. Before I move from one architecture with such issue to ANOTHER architecture with such issue, i'd like to ask:
Does Tornado (and/or Twisted) have an architecture of scheduling tasks in the same way Gevent does? (this means: when certain greenlets "block", they schedule themselves to other threads, at least until the operation finishes). I'm asking this because (please correct me if I'm wrong) a regular django view will not be suitable for stuff like #inlineCallbacks, and will cause the whole server to be blocked (incl. the websockets).
I'm new to async programming in python, so there's a huge change I have misinformation about more than one concept. Please help me clarifying this before I switch.
Neither Tornado nor Twisted have anything like gevent's magic to run (some) blocking code with the performance characteristics of asynchronous code. Idiomatic use of either Tornado or Twisted will be visible throughout your app in the form of callbacks and/or Futures/Deferreds.
In general, since you'll need to run multiple python processes anyway due to the GIL, it's usually best to dedicate some processes to websockets with Tornado/Twisted and other processes to Django with the WSGI container of your choice (and then put nginx or haproxy in front so it looks like a single service to the outside world).
If you still want to combine django and an asynchronous service in the same process, the next best solution is to use threads. If you want the two to share one listening port, the listener must be a websocket-aware HTTP server that can spawn other threads for WSGI requests. Tornado does not yet have a solution for this, although one is planned for version 4.1 (https://github.com/tornadoweb/tornado/pull/1075). I believe Twisted's WSGI container does support running the WSGI workers in threads, but I don't have any experience with it myself. If you need them in the same process but do not need to share the same port, then you can simply run the IOLoop or Reactor in one thread and the WSGI container of your choice in another (with its associated worker threads).
First of all please let me be clear that I am a windows user and very new to the web world. For the past months I have been learning both python and django, and it has been a great experience for me. Now I have somehow created a small project that I would like to deploy in the production server. Since django has its built-in development server there was no problem for me. But now that I have to deploy it to a production server I googled around and found Nginx + uWSGI or Nginx + Gunicorn as the best option for it. And as uWSGI and Gunicord are incompatible with Windows, I think I should adapt Ubuntu or other Unix system.
So my questions are:
Just to be clear, as I will have to work with one of the above, please explain to me why do I need two servers?
If I have to adapt the Ubuntu environment, do I have to learn Ubuntu shell scripting, SSH and other stuff? Or the hosting provider will help me do that?
Please let me be aware of what else do I need for the above concerned.
Thank you so much for your time and please pardon if my question was a lame question. Hoping for positive response answers.
A typical configuration involves two server processes (which can be run together on the same actual hardware or virtual server) so that the proxy server in front can buffer slow clients. For instance: a slow client will connect to nginx with a request. Nginx will pass the request on to Gunicorn and Gunicorn will respond. Nginx will then consume the Gunicorn response immediately, freeing up the Gunicorn resources right away. At that point, the slow client can take as much time as it wants to consume the response from Nginx without tying up much in the way of server resources. Alternatives to the two-server-process model are to use async workers with Gunicorn and put Gunicorn itself in front, or to use an async-sync combo like Waitress. Nginx in front has the added benefit of doubling as a ready-to-use statics server, though.
Note that "slow clients" can describe: mobile phones that lose their connection and leave the TCP socket hanging until timeout mid-request; mobile phones that are just slow; unreliable connections of all types; hostile denial-of-service clients who are deliberately trying to use server resources; sometimes any old connection that has a hiccup or malfunction for any reason. So this is a problem that will affect nearly any site.
You won't need shell scripting per se but getting used to Ubuntu will take some time. There is a lot to learn even outside of scripting, like how to use the package manager, how to configure packages once they're installed in ways that won't confound future updates, etc. And you will definitely have to learn to use SSH; it is one of the most fundamental server administration tools in the *nix world.
An alternative to learning to use Ubuntu or another server platform is to use a Platform-as-a-Service option like Heroku, as PaaS hosting providers really will take care of all of that stuff for you. I recommend this approach. That having been said, even though I think PaaS is a good option for people who want to focus on development and not server admin regardless of their level of skill, it's also true that a little bit of experience with Linux server platforms goes a long way in helping you to understand the environment that your code runs in. So even if you go with PaaS, you would still benefit from tinkering with Ubuntu a little (or a lot).
Another benefit from a PaaS is that normally their infrastructure handles the Nginx part of the deal (buffering of slow requests via proxy). This is the case with Heroku, for instance. So you won't have to worry about that part of the infrastructure at all.
This part of the question is too broad to answer, but let me know in the comments if you need clarification.
I'm doing it almoast like in this tutorial: http://michal.karzynski.pl/blog/2013/06/09/django-nginx-gunicorn-virtualenv-supervisor/
Nginx is my proxy to django app running on gunicorn and its serving statics, virtualenv for my python enviroment, supervisor to watch my app's running.
It's possible you will run in some error's if not using Postgresql, ask then I will help (used MySQL in the past now it's Postgresql)
Firstly, there's no need to use Ubuntu if you're happier with Windows. I don't know if nginx works on Windows, but I'd be very surprised if it doesn't (in fact, here are the nginx docs for installing on Windows). Apache, meanwhile, definitely does work on Windows. The Django documentation has a full explanation of how to set up Apache/mod_wsgi to serve Django.
You don't need two servers. I'm not sure why you think you do: the usual reason for that is to have the static assets on a separate server, but you don't mention that as a reason. Since you're only talking about a small site, though, you don't even need to do that. One server configured to serve both Django and the static assets will do fine. Again, the docs explain exactly how to do that.
I'm currently working on a project and I'd like to integrate asynchronous task processing as well as some sort of message queue early on so that I'll be able to scale up quickly by simply adding message queue processor servers to the cluster.
I came across Celery a while back and it caught my eye. Since it's pretty well integrated with Django, I figured I'd get pretty good support with it. I'm just not really sure how to start, as there's a lot of configuration involved.
For now, I'm running just about everything out of my Django project (serving static files, pipeline, etc.) so I'd like to have a messaging queue built in to run with django runserver if possible. (Don't worry, this is only for development.) How can I get started using Celery with my existing Django project?
djkombu is now deprecated, the django transport is now directly integrated in the kombu package.
For defining the backend in your Django settings.py, you can use:
BROKER_BACKEND = "django"
You can find different transport aliases from Kombu here.
This was tested with django-celery 2.5.5, celery 2.5.3 and kombu 2.1.8.
Celery has quite a nice documentation, also for those getting started, but two facts worth being mentioned for beginners:
Use djkombu as the BROKER_BACKEND. This will give you a pretty simple message queue for development, where all messages are stored in the SQL database used by django. Due to celery's api you can easily replace it with a "real" message queue for production:
BROKER_TRANSPORT = "kombu.transport.django"
Django-celery has a setting CELERY_ALWAYS_EAGER. If set to True there will be no asynchronous background processing, all tasks that are getting called via celery will be run synchronously (so no need to start any additional celery workers - very useful for debugging as well).
My goal is to create an application that will be able to do long-lasting mainly system tasks, such as:
checking out code from the repositories,
copying directories between various localizations,
etc.
The problem is I need to prepare it somehow independently from the web browser. I mean that for example after starting the checkout/copy action, closing the web browser will not interrupt the action. So after going back to that site I can see that the copying goes on or another action started when the browser was closed...
I was searching through various tools, like RabbitMQ + Celery, Twisted, Pyro, XML-RPC but I don't know if any of these will be suitable for me. Has anyone encountered similar needs when creating Django app? Please let me know if there are any methods/packages that I should know. Code samples also will be more than welcome!
Thank you in advance for your suggestions!
(And sorry for my bad English. I'm working on it.)
Basically you need to have a process that runs outside of the request. The absolute simplest way to do this (on a Unix-like operating system, at least) is to fork():
if os.fork() == 0:
do_long_thing()
sys.exit(0)
… continue with request …
This has some downsides, though (ex, if the server crashes, the “long thing” will be lost)… Which is where, ex, Celery can come in handy. It will keep track of the jobs that need to be done, the results of jobs (success/failure/whatever) and make it easy to run the jobs on other machines.
Using Celery with a Redis backend (see Kombu's Redis transport) is very simple, so I would recommend looking there first.
You might need to have a process outside the request / response cycle. If that is the case, Celery with a Redis backend is what I would suggest looking into, as that integrates nicely with Django (as David Wolever suggested).
Another option is to create Django management commands, and then use cron to execute them at scheduled intervals.