Worker threads/Queues to process datasets after an upload?

Worker threads/Queues to process datasets after an upload? - django

I'm writing a web application with Django where users can upload files with statistical data.
The data needs to be processed before it can be properly used (each dataset can take up to a few minutes of time before processing is finished). My idea was to use a python thread for this and offload the data processing into a separate thread.
However, since I'm using uwsgi, I've read about a feature called "Spoolers". The documentation on that is rather short, but I think it might be what I'm looking for. Unfortunately the -Q option for uwsgi requires a directory, which confuses me.
Anyway, what are the best practices to implement something like worker threads which don't block uwsgi's web workers so I can reliably process data in the background while still having access to Django's database/models? Should I use threads instead?

All of the offloading subsystems need some kind of 'queue' to store the 'things to do'.
uWSGI Spooler uses a printer-like approach where each file in the directory is a task. When the task in done the file is removed. Other systems relies on more heavy/advanced servers like rabbitmq and so on.
Finally, do not directly use the low-level api of the spooler but rely on decorators:
http://projects.unbit.it/uwsgi/wiki/Decorators

Related

Multiprocess web server with ocaml

I want to make webserver with ocaml. It will have REST interface and will have no dependencies (just searching in constant data loaded to RAM on process startup) and serve read only queries (which can be served from any node - result will be the same).
I love OCaml, however, I have one problem that it can only process using on thread at a time.
I think of scaling just by having nginx in front of it and load balance to multiple process instances running on different ports on the same server.
I don't think I'm the only one running into this issue, what would be the best tool to keep running few ocaml processes at a time and to ensure that if any of them crash they would be restarted and have different ports from each other (to load balance between them)?
I was thinking about standard linux service but I don't want to create like 4 hardcoded records and call service start webserver1 on each of them.

Is there a strong requirement for multiple operating system processes? Otherwise, it seems like you could just use something like cohttp with either lwt or async to handle concurrent requests in the same OS process, using multiple threads (and an event-loop).
As you mentioned REST, you might be interested in ocaml-webmachine which is based on cohttp and comes with well-commented examples.

Django and Websockets: How to create, in the same process, an efficient project with both WSGI and websockets?

I'm trying to do a Django application with an asynchronous part: Websockets. Just as a little challenge, I want to mount everything in the same process. Tried Socket.IO but couldn't manage to actually use sockets, instead of longpolling (which killed my browser several times, until I gave up).
What I then tried was a not-so-maintained library based on gevent-websocket. However, had many errors and was not easy to debug.
Now I am trying a Tornado approach but AFAIK (please correct me if I'm wrong) integrating async with a regular django app wrapped by WSGIContainer (websockets would go through Tornado, regular connections through Django) will be a true server killer if a resource is heavy or, somehow, the Django ORM goes slow into heavy operations.
I was thinking on moving to Twisted/Cyclone. Before I move from one architecture with such issue to ANOTHER architecture with such issue, i'd like to ask:
Does Tornado (and/or Twisted) have an architecture of scheduling tasks in the same way Gevent does? (this means: when certain greenlets "block", they schedule themselves to other threads, at least until the operation finishes). I'm asking this because (please correct me if I'm wrong) a regular django view will not be suitable for stuff like #inlineCallbacks, and will cause the whole server to be blocked (incl. the websockets).
I'm new to async programming in python, so there's a huge change I have misinformation about more than one concept. Please help me clarifying this before I switch.

Neither Tornado nor Twisted have anything like gevent's magic to run (some) blocking code with the performance characteristics of asynchronous code. Idiomatic use of either Tornado or Twisted will be visible throughout your app in the form of callbacks and/or Futures/Deferreds.
In general, since you'll need to run multiple python processes anyway due to the GIL, it's usually best to dedicate some processes to websockets with Tornado/Twisted and other processes to Django with the WSGI container of your choice (and then put nginx or haproxy in front so it looks like a single service to the outside world).
If you still want to combine django and an asynchronous service in the same process, the next best solution is to use threads. If you want the two to share one listening port, the listener must be a websocket-aware HTTP server that can spawn other threads for WSGI requests. Tornado does not yet have a solution for this, although one is planned for version 4.1 (https://github.com/tornadoweb/tornado/pull/1075). I believe Twisted's WSGI container does support running the WSGI workers in threads, but I don't have any experience with it myself. If you need them in the same process but do not need to share the same port, then you can simply run the IOLoop or Reactor in one thread and the WSGI container of your choice in another (with its associated worker threads).

How to use run distribute tasks on worker nodes in a Clojure app?

On Python/Django stack, we were used to using Celery along with RabbitMQ.
Everything was easily done.
However when we tried doing the same thing in Clojure land, what we could get was Langhour.
In our current naive implementation we have a worker system which has three core parts.
Publisher module
Subscriber module
Task module
We can start the system on any node in either publisher or subscriber mode.
They are connected to RabbitMQ server.
They share one worker_queue.
What we are doing is creating tasks in Task module, and then when we want to run a task on subscriber. we send an expression call to the method, in EDN format to Subscriber which then decodes this and runs the actual task using eval.
Now is using eval safe ? we are not running expressions generated by user or any third party system.Initially we were planning to use JSON to send the payload message but then EDN gave us a lot more flexibility and it works like a charm, as of now.
Also is there a better way to do this ?

Depends on you needs (and your team), I highly suggest Storm Project. You will get a distributed, fault tolerant and realtime computation and it is really easy to use.
Another nice thing in Storm that it supports a plethora of options as the datasource for the topologies. It can be for example: Apache Kafka, RabbitMQ, Kestrel, MongoDB. If you aren't satisfied, then you can write your own driver.
It is also has a web interface to see what is happening in your topology.

How do I handle blocking IO in mod_wsgi/django?

I am running Django under Apache+mod_wsgi in daemon mode with the following config:
WSGIDaemonProcess myserver processes=2 threads=15
My application does some IO on the backend, which could take several seconds.
def my_django_view:
content=... # Do some processing on backend file
return HttpResponse(content)
It appears that if I am processing more than 2 http requests that are handling this kind of IO, Django will simply block until one of the previous requests completes.
Is this expected behavior? Shouldn't threading help alleviate this i.e. shouldn't I be able to process up to 15 separate requests for a given WSGI process, before I see this kind of wait?
Or am I missing something here?

If the processing is in python, then Global Interpreter Lock is not being released -- in a single python process only one thread of python code can be executing at a time. The GIL is usually released inside C code though -- like most I/O, for example.
If this kind of processing is going to happen a lot, you might consider running a second "worker" application as a deamon, reading tasks from the database, performing the operations and writing resulsts back to the database. Apache might decide to kill processes that take too long to respond.

+1 to Radomir Dopieralski's answer.
If the task takes long you should delegate it to a process outside the request-response cycle, either by using a standard cron, or some distributed task queue like Celery

Databases for workload offloading were quite the thing in 2010, and a good idea then, but we've come a bit farther now.
We're using Apache Kafka as a queue to store our in-flight workload. So, Dataflow is now:
User -> Apache httpd -> Kafka -> python daemon processor
User post operation puts data into system to be processed via wsgi app that just writes it very fast to a Kafka queue. Minimal sanity checking is done in the post operation to keep it fast but find some obvious problems. Kafka stores the data very fast so the http response is zippy.
A separate set of python daemons pull data from Kafka and do processing on it. We actually have multiple processes that need to process it differently, but Kafka makes that fast by only writing once and having multiple readers read the same data if needed; no penalty for duplicate storage is incurred.
This allows very, very fast turnaround; optimal resource usage since we have other boxes offline handle the pull-from-kafka and can tune that to reduce lag as needed. Kafka is HA with same data written to multiple boxes in the cluster so my manager doesn't complain about 'what happens if' scenarios.
We're quite happy with Kafka. http://kafka.apache.org

System architecture: simple approach for setting up background tasks behind a web application -- will it work?

I have a Django web application and I have some tasks that should operate (or actually: be initiated) on the background.
The application is deployed as follows:
apache2-mpm-worker;
mod_wsgi in daemon mode (1 process, 15 threads).
The background tasks have the following characteristics:
they need to operate in a regular interval (every 5 minutes or so);
they require the application context (i.e. the application packages need to be available in memory);
they do not need any input other than database access, in order to perform some not-so-heavy tasks such as sending out e-mail and updating the state of the database.
Now I was thinking that the most simple approach to this problem would be simply to piggyback on the existing application process (as spawned by mod_wsgi). By implementing the task as part of the application and providing an HTTP interface for it, I would prevent the overhead of another process that is holding all of the application into memory. A simple cronjob can be setup that sends a request to this HTTP interface every 5 minutes and that would be it. Since the application process provides 15 threads and the tasks are quite lightweight and only running every 5 minutes, I figure they would not be hindering the performance of the web application's user-facing operations.
Yet... I have done some online research and I have seen nobody advocating this approach. Many articles suggest a significantly more complex approach based on a full-blown messaging component (such as Celery, which uses RabbitMQ). Although that's sexy, it sounds like overkill to me. Some articles suggest setting up a cronjob that executes a script which performs the tasks. But that doesn't feel very attractive either, as it results in creating a new process that loads the entire application into memory, performs some tiny task, and destroys the process again. And this is repeated every 5 minutes. Does not sound like an elegant solution.
So, I'm looking for some feedback on my suggested approach as described in the paragraph before the preceeding paragraph. Is my reasoning correct? Am I overlooking (potential) problems? What about my assumption that application's performance will not be impeded?

All are reasonable approaches depending on your specific requirements.
Another is to fire up a background thread within the process when the WSGI script is loaded. This background thread could simply sleep and wake up occasionally to perform required work and then go back to sleep.
This method necessitates though that you have at most one Django process which the background thread runs in to avoid different processing doing the same work on any database etc.
Using daemon mode with a single process as you are would satisfy that criteria. There are potentially other ways you could achieve that though even in a multiprocess configuration.

Note that celery works without RabbitMQ as well. It can use a ghetto queue (SQLite, MySQL, Postgres, etc, and Redis, MongoDB), which is useful in testing or for simple setups where RabbitMQ seems overkill.
See http://ask.github.com/celery/tutorials/otherqueues.html
(Using Celery with Redis/Database as the messaging queue.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js