I'm starting to get familiar with the RabbitMQ lingo so I'll try my best to explain. I'll be going into a public beta test in a few weeks and this is the set up I am hoping to achieve. I would like Django to be the producer; producing messages to a remote RabbitMQ box and another Celery box listening on the RabbitMQ queue for tasks. So in total there would be three boxes. Django, RabbitMQ & Celery. So far, from the Celery docs, I have successfully been able to run Django and Celery together and Rabbit MQ on another machine. Django simply calls the task in the view:
add.delay(3, 3)
And the message is sent over to RabbitMQ. RabbitMQ sends it back to the same machine that the task was sent from (since Django and celery share the same box) and celery processes the task.
This is great for development purposes. However, having Django and Celery running on the same box isn't a great idea since both will have to compete for memory and CPU. The whole goal here is to get clients in and out of the HTTP Request cycle and have celery workers process the tasks. But the machine will slow down considerably if it is accepting HTTP requests and also processing tasks.
So I was wondering is there was a way to make this all separate from one another. Have Django send the tasks, RabbitMQ forward them, and Celery process them (Producer, Broker, Consumer).
How can I go about doing this? Really simple examples would help!
What you need is to deploy the code of your application on the third machine and execute there only the command that handles the tasks. You need to have the code on that machine also.
Related
I was reading the documentation of django-channels for the first time today and found the following line Channels will take care of scheduling them and running them all in parallel. Does this mean, it does the celery task as well? I used to get confuse with the celery, rabbitmq, mqtt. I thought I was clear and my understanding of celery, rabbitmq and mqtt was
celery - background job, task scheduling
rabbitmq - message broker, sends message to worker
mqtt - it's also another message queuing
in my understanding, celery does both background job tasks as well as of rabbitmq task or mqtt task.
so my question is, when using django-channel, will we need to use those listed stacks(celery, rabbitmq) ? If needed, why it is needed? I look over several articles but could not get the insight clearly. I feel their usecase are somewhat similar. Can anyone clear my confusion with real life examples?
I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.
I'm having a really hard time conceptualising how I can connect to the the twitter streaming api and process tweets via an admin interface provided by Django.
The main problem is starting a daemon from Django and having the ability to stop/start it, plus making sure there is provision for monitoring. I don't really want to use upstart for this purpose because I want to try and keep the project as self contained as possible.
I'm currently attempting the following and am unsure if it's perhaps the wrong way to go about things
Start a celery task from Django which establishes a persistent connection to the streaming API
The above task creates subtasks which will process tweets and store them
Because celeryd runs as a daemon it will automatically run the first task again if the connection breaks and the task fails - does this mean I don't need any additional monitoring?
Does the above make sense or have I misunderstood how celery works?
I have everything I want to do with django-celery working on my development machine locally. I run Django, djcelery, cellery and the broker (Amazon SQS). It sends tasks and just works.
I can set this all up like I have done locally (i.e. all on one
machine), but what happens when I want to distribute tasks to another
machine/share tasks etc.? Is this a copy of the current machine (with Django, djcelery and celery) and all connection to the same SQS? How does this work? If they all connection to the same broker do they just 'know'? or does it not work like this?
Is it ok to start off with everything on one machine like I did in development (I will daemonize celery in production)?
Amazon SQS is a Simple Queueing Service, jobs go in wait to be run and then removed from the queue once complete. Celery simply reads off of this queue.
Celery can scale both horizontally and vertically. You need celery to process more jobs faster? Give your machine more resources, up the worker count, thats vertical scaling, or boot more smaller machines which is horizontal scaling. Either way your celery workers are all consuming the same queue on SQS. It does depend on what your celery jobs are doing as to how the rest of your infrastructure will be affected. If they are writing to a DB the more workers you have the higher the load on your DB so you would need to look at scaling that too.
It is OK to start off with the "all" on one machine approach. As the demand on your app grows you can start looking at moving celery workers off to more machines or give your all in one server more resources.
Does this help? :)
I'm developing celery tasks to aggregate social contents from facebook and twitter.
Tasks are as following
facebook_service_handler
facebook_contents_handler
image_resize
save_contents_info
'facebook_service_handler' and 'facebook_contents_handler' tasks use facebook open api with urlopen function.
It is working well when urlopen requests is not many. (under 4~5 times)
but when the urlopen request is over the 4~5, worker is not working anymore.
also,
When the celery is stopped, I break the redis and celeryd, and restart celeryd and redis.
last tasks are executed
any body help me about this problem??
I'm working it on the mac os lion.
Ideally, you should have two different queues, one for network I/O (using eventlet, with which you can "raise" more processes) and one for the other tasks (using multiprocessing). If you feel that this is complicated, take a look at CELERYD_TASK_SOFT_TIME_LIMIT. I had similar problems when using urllib.open within celery tasks, as the connection might hang and mess the whole system.