I managed to get Django and RabbitMQ and Celery work on single machine. I have followed instructions from here. Now I want to make them work together but in situation when they are on different servers. I do not want Django knows anything about Celery nor Celery about Django.
So, basically I just want in Django to send some message to RabbitMQ queue (probably id, type of task, maybe some other info), and then I want RabbitMQ to publish that message (when its possible) to Celery on another server. Celery/Django should not know about each other, basically I want architecture where it is easy to replace any one of them.
Right now I have in my Django several calls like
create_project.apply_async(args, countdown=10)
I want to replace that with similar calls directly to RabbitMQ (as I said Django should not depend on Celery). Then, RabbitMQ should notify Celery (when it is possible) and Celery will do its job (probably interact with Django but through REST interface).
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message. If this is to complicated I could just check in every task (on different machines) something like: is this is something you should do (like checking ip address field in message) and if its not than just stop with execution of task.
How can I achieve this? if possible I would prefer code + configuration examples not just theoretical explanation.
Edit:
I think that for my use case celery is total overhead. Simple RabbitMQ
routing with custom clients will do the job. I already tried simple
use case (one server) and it works perfectly. It should be easy to
make communication multi-server ready. I do not like celery. It is
"magical", hides too many details and it is not easy to configure. But I will leave this question alive, because I am interested in others opinions.
The short of it
How can I achieve this?
Celery only sends the task name and a serialized set of parameters as the message body. That is your scenario is absolutely in line with how Celery operates.
if possible I would prefer code + configuration examples not just theoretical explanation.
For the client app, i.e. your Django app, define stub tasks, like so:
#task
def foo():
pass
For the Celery processing, on your remote server, define the actual tasks to be executed.
#task
def foo():
pass
It is important that the tasks live in the same Python package in both sides (i.e. app.tasks.py, otherwise Celery won't be able to match the message to the actual task.
Note that this also means your Django app becomes untestable if you have set CELERY_ALWAYS_EAGER=True, unless you make the Celery apps's tasks.py available locally to the Django app.
Even Simpler Alternative
An alternative to the above stub tasks is to send tasks by name:
>>> app.send_task('tasks.add', args=[2, 2], kwargs={})
<AsyncResult: 373550e8-b9a0-4666-bc61-ace01fa4f91d>
On Message Patterns
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message.
RabbitMQ offers several messaging patterns, their tutorials are quite well written and to the point. What you want (one message processed by one worker) is trivially achieved with a simple queue/exchange setup, which (with Celery at least) is the default if you don't do anything else. If you need specific workers to attend to specific tasks/respond to specific messages, use Celery's task routing which works hand-in-hand with RabbitMQ's concept of queues and exchanges.
Trade-Offs
I think that for my use case celery is total overhead. Simple RabbitMQ routing with custom clients will do the job. I already tried simple use case (one server) and it works perfectly.
Of course, you may use RabbitMQ out of the box, at the cost of having to deal with the lower-level API that RabbitMQ provides. Celery adds a task abstraction that makes it very straight forward to build any producer/consumer scenario, essentially using just plain Python functions or methods. Note that this is not a better/worse judgement of either RabbitMQ or Celery -- as always with engineering decisions, there is trade-off involved:
If you use Celery, you probably loose some of the flexibility of the RabbitMQ API, but you gain ease of development, while gaining in development speed and a lower deployment complexity -- it basically just works.
If you use RabbitMQ directly, you gain flexibility, but with this comes deployment complexity that you need to manage yourself.
Depending on the requirements of your project, either approach may be valid - your call, really.
Any sufficiently advanced technology is indistinguishable from magic ;-)
I do not like celery. It is "magical", hides too many details and it is not easy to configure.
I would choose to disagree. It may be "magical" in Arthur C. Clarke's sense, but it certainly is rather easy to configure if you compare it to a plain RabbitMQ setup. Of course if you're also the guy who does the RabbitMQ setup, it may just add a layer of abstraction that you don't really gain anything from. Maybe your developers will?
Related
I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !
Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.
I'm trying to do a Django application with an asynchronous part: Websockets. Just as a little challenge, I want to mount everything in the same process. Tried Socket.IO but couldn't manage to actually use sockets, instead of longpolling (which killed my browser several times, until I gave up).
What I then tried was a not-so-maintained library based on gevent-websocket. However, had many errors and was not easy to debug.
Now I am trying a Tornado approach but AFAIK (please correct me if I'm wrong) integrating async with a regular django app wrapped by WSGIContainer (websockets would go through Tornado, regular connections through Django) will be a true server killer if a resource is heavy or, somehow, the Django ORM goes slow into heavy operations.
I was thinking on moving to Twisted/Cyclone. Before I move from one architecture with such issue to ANOTHER architecture with such issue, i'd like to ask:
Does Tornado (and/or Twisted) have an architecture of scheduling tasks in the same way Gevent does? (this means: when certain greenlets "block", they schedule themselves to other threads, at least until the operation finishes). I'm asking this because (please correct me if I'm wrong) a regular django view will not be suitable for stuff like #inlineCallbacks, and will cause the whole server to be blocked (incl. the websockets).
I'm new to async programming in python, so there's a huge change I have misinformation about more than one concept. Please help me clarifying this before I switch.
Neither Tornado nor Twisted have anything like gevent's magic to run (some) blocking code with the performance characteristics of asynchronous code. Idiomatic use of either Tornado or Twisted will be visible throughout your app in the form of callbacks and/or Futures/Deferreds.
In general, since you'll need to run multiple python processes anyway due to the GIL, it's usually best to dedicate some processes to websockets with Tornado/Twisted and other processes to Django with the WSGI container of your choice (and then put nginx or haproxy in front so it looks like a single service to the outside world).
If you still want to combine django and an asynchronous service in the same process, the next best solution is to use threads. If you want the two to share one listening port, the listener must be a websocket-aware HTTP server that can spawn other threads for WSGI requests. Tornado does not yet have a solution for this, although one is planned for version 4.1 (https://github.com/tornadoweb/tornado/pull/1075). I believe Twisted's WSGI container does support running the WSGI workers in threads, but I don't have any experience with it myself. If you need them in the same process but do not need to share the same port, then you can simply run the IOLoop or Reactor in one thread and the WSGI container of your choice in another (with its associated worker threads).
I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.
My goal is to create an application that will be able to do long-lasting mainly system tasks, such as:
checking out code from the repositories,
copying directories between various localizations,
etc.
The problem is I need to prepare it somehow independently from the web browser. I mean that for example after starting the checkout/copy action, closing the web browser will not interrupt the action. So after going back to that site I can see that the copying goes on or another action started when the browser was closed...
I was searching through various tools, like RabbitMQ + Celery, Twisted, Pyro, XML-RPC but I don't know if any of these will be suitable for me. Has anyone encountered similar needs when creating Django app? Please let me know if there are any methods/packages that I should know. Code samples also will be more than welcome!
Thank you in advance for your suggestions!
(And sorry for my bad English. I'm working on it.)
Basically you need to have a process that runs outside of the request. The absolute simplest way to do this (on a Unix-like operating system, at least) is to fork():
if os.fork() == 0:
do_long_thing()
sys.exit(0)
… continue with request …
This has some downsides, though (ex, if the server crashes, the “long thing” will be lost)… Which is where, ex, Celery can come in handy. It will keep track of the jobs that need to be done, the results of jobs (success/failure/whatever) and make it easy to run the jobs on other machines.
Using Celery with a Redis backend (see Kombu's Redis transport) is very simple, so I would recommend looking there first.
You might need to have a process outside the request / response cycle. If that is the case, Celery with a Redis backend is what I would suggest looking into, as that integrates nicely with Django (as David Wolever suggested).
Another option is to create Django management commands, and then use cron to execute them at scheduled intervals.
I have a Django web application and I have some tasks that should operate (or actually: be initiated) on the background.
The application is deployed as follows:
apache2-mpm-worker;
mod_wsgi in daemon mode (1 process, 15 threads).
The background tasks have the following characteristics:
they need to operate in a regular interval (every 5 minutes or so);
they require the application context (i.e. the application packages need to be available in memory);
they do not need any input other than database access, in order to perform some not-so-heavy tasks such as sending out e-mail and updating the state of the database.
Now I was thinking that the most simple approach to this problem would be simply to piggyback on the existing application process (as spawned by mod_wsgi). By implementing the task as part of the application and providing an HTTP interface for it, I would prevent the overhead of another process that is holding all of the application into memory. A simple cronjob can be setup that sends a request to this HTTP interface every 5 minutes and that would be it. Since the application process provides 15 threads and the tasks are quite lightweight and only running every 5 minutes, I figure they would not be hindering the performance of the web application's user-facing operations.
Yet... I have done some online research and I have seen nobody advocating this approach. Many articles suggest a significantly more complex approach based on a full-blown messaging component (such as Celery, which uses RabbitMQ). Although that's sexy, it sounds like overkill to me. Some articles suggest setting up a cronjob that executes a script which performs the tasks. But that doesn't feel very attractive either, as it results in creating a new process that loads the entire application into memory, performs some tiny task, and destroys the process again. And this is repeated every 5 minutes. Does not sound like an elegant solution.
So, I'm looking for some feedback on my suggested approach as described in the paragraph before the preceeding paragraph. Is my reasoning correct? Am I overlooking (potential) problems? What about my assumption that application's performance will not be impeded?
All are reasonable approaches depending on your specific requirements.
Another is to fire up a background thread within the process when the WSGI script is loaded. This background thread could simply sleep and wake up occasionally to perform required work and then go back to sleep.
This method necessitates though that you have at most one Django process which the background thread runs in to avoid different processing doing the same work on any database etc.
Using daemon mode with a single process as you are would satisfy that criteria. There are potentially other ways you could achieve that though even in a multiprocess configuration.
Note that celery works without RabbitMQ as well. It can use a ghetto queue (SQLite, MySQL, Postgres, etc, and Redis, MongoDB), which is useful in testing or for simple setups where RabbitMQ seems overkill.
See http://ask.github.com/celery/tutorials/otherqueues.html
(Using Celery with Redis/Database as the messaging queue.)