I'm developing celery tasks to aggregate social contents from facebook and twitter.
Tasks are as following
facebook_service_handler
facebook_contents_handler
image_resize
save_contents_info
'facebook_service_handler' and 'facebook_contents_handler' tasks use facebook open api with urlopen function.
It is working well when urlopen requests is not many. (under 4~5 times)
but when the urlopen request is over the 4~5, worker is not working anymore.
also,
When the celery is stopped, I break the redis and celeryd, and restart celeryd and redis.
last tasks are executed
any body help me about this problem??
I'm working it on the mac os lion.
Ideally, you should have two different queues, one for network I/O (using eventlet, with which you can "raise" more processes) and one for the other tasks (using multiprocessing). If you feel that this is complicated, take a look at CELERYD_TASK_SOFT_TIME_LIMIT. I had similar problems when using urllib.open within celery tasks, as the connection might hang and mess the whole system.
Related
I am sorry if its basics but I did not find any answers on the Internet comparing these two technologies. How should I decide when to use which as both can be used to schedule and process periodic tasks.
This is what an article says:
Django-celery :
Jobs are essential part of any application that does
some processing for you in the background. If your job is real time
Django application celery can be used.
Django-cronjobs :
django-cronjobs can be used to schedule periodic_task which is a
valid job. django-cronjobs is a simple Django app that runs registered
cron jobs via a management command.
can anyone explain me the difference between when should I choose which one and Why? Also I need to know why celery is used when the computing is distributed and why not cron jobs
The two things can be used for the same goal (background execution). However, if you are going to choose wisely, you should really understand that they are actually completely different things.
Here's what I wish someone had told me back when I was a noob (instead of the novice level that I have achieved today :)).
cron
The concept of a cron job is that we want a command / process to be executed on some schedule. Furthermore, we want that process to receive x,y,z parameters, run with a,b,c environment variables, and as user id 123.
Some cron systems may facilitate a few extra features, such as:
catching up on missed tasks (e.g. the server was off for a power outage all night and as soon as we turn it on, it runs the 8 instances of the command we normally run hourly).
might help you with the type of locking you normally do using a pid file in order to avoid parallel runs of the same command.
For the most part, cron systems are meant to be dumb: "just run this command at this time, thanks!".
Celery
The concept of Celery is much more sophisticated. It works with tasks, chains & chords of tasks, error handling, and (in most cases) collection of work result. It has a queue (or many queues) of work and a worker (or many). When a task (really just a message describing requested work) enters the queue it waits there until a worker is available to handle it. Much the same way as 1 or more employees at the DMV service a room full of waiting customers.
Furthermore, Celery can facilitate distributed work. That's a bit like (if I may torture the analogy a bit) - the difference between a DMV office where every worker shares the same phone, computer, copier, etc and a DMV where workers have dedicated resources and are never blocked by other workers.
Celery for web apps
In web applications, Celery is often used when a bit of web access results in a thing to be done that should be handled out of band of the conversation with the web browser. For example:
the web user just did something which should result in an email being sent. In order to send an email, your web server will need to contact a mail server. This could take time, the server could be busy, etc - we cant make the web user just wait, seeing nothing on their browser while we do this. Well, you can but it won't work reliably. So, we do that email send as a bit of work in the queue. That way, it can happen "whenever" and the web server can get back to communicating with the browser.
the user just submitted a credit card as payment. You're going to need to contact the card processor, but that might take several seconds. You might even have to contact them multiple times (e.g. they are really busy there right now). Again, you don't want your user's web browser to just sit blankly and you don't want a web server process or thread of execution tied up. Instead, you use Celery to create a job, you tell the browser to check back in a few seconds (or use a "web socket"), and your web server moves on and talks to other web users. When the browser checks back later, you lookup the task id and find out from celery whether it is finished and what the outcome was (card declined, etc).
Using Celery as cron
When you use Celery as a "cron system" all you are really doing is saying: "hey, can someone please generate work of X type on Y schedule". A process is created that runs continuously which sleeps most of the time and wakes up occasionally to inject a bit of work into the queue on the schedule you requested.
Usually the "hey someone" that you ask to do that for you is: celery beat and beat gets the schedule you want from somewhere in the database or from your settings file.
I searched for celery vs cron and found a few results that might be helpful to you.
https://www.reddit.com/r/Python/comments/m2dg8/explain_like_im_five_why_or_why_not_would_celery/
Why would running scheduled tasks with Celery be preferable over crontab?
Distributed task queues (Ex. Celery) vs crontab scripts
I'm starting to get familiar with the RabbitMQ lingo so I'll try my best to explain. I'll be going into a public beta test in a few weeks and this is the set up I am hoping to achieve. I would like Django to be the producer; producing messages to a remote RabbitMQ box and another Celery box listening on the RabbitMQ queue for tasks. So in total there would be three boxes. Django, RabbitMQ & Celery. So far, from the Celery docs, I have successfully been able to run Django and Celery together and Rabbit MQ on another machine. Django simply calls the task in the view:
add.delay(3, 3)
And the message is sent over to RabbitMQ. RabbitMQ sends it back to the same machine that the task was sent from (since Django and celery share the same box) and celery processes the task.
This is great for development purposes. However, having Django and Celery running on the same box isn't a great idea since both will have to compete for memory and CPU. The whole goal here is to get clients in and out of the HTTP Request cycle and have celery workers process the tasks. But the machine will slow down considerably if it is accepting HTTP requests and also processing tasks.
So I was wondering is there was a way to make this all separate from one another. Have Django send the tasks, RabbitMQ forward them, and Celery process them (Producer, Broker, Consumer).
How can I go about doing this? Really simple examples would help!
What you need is to deploy the code of your application on the third machine and execute there only the command that handles the tasks. You need to have the code on that machine also.
I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.
I have a django app I want to migrate to dotcloud.
Many actions in Django internals and in my app are not asynchronous, i.e. they block the thread until they finish.
When I was using Apache, that didn't pose a problem since a different thread is opened on every request. But it doesn't seem to be the case in nginx/uwsgi that dotcloud use.
Seemingly, uwsgi has a --enable-threads and --threads options that can be used for multithreading, but:
It is not clear what version of uwsgi dotcloud use, and if they support these features
Since I have no one else asking about this, I was wondering if this is really the right way to get the concurrent requests running (using threads)
You could run Django with Gunicorn. Gunicorn, in turn, supports multiple worker classes, and people reported success running gunicorn+gevents+django together[1][2].
To use that on dotCloud, you will probably have to use dotCloud's custom service. If that's something that you want to try, I would personally start with dotCloud's reimplementation of python service using the custom service, and replace uwsgi with gunicorn in it.
I came here looking for some leads, which I found, thanks!
There was a fair amount of leg work left to actually get stuff working, though.
Here is an example app on github that uses gunicorn, gevent, and socketio on dotcloud:
https://github.com/t1m0thy/django-tictactoe/tree/dotcloud
Threads is a problem in python - GIL doesn't allow them to run simultaneously.
So multiprocessing is an answer.
Or you may take a look at gevent. Actually gevent is a kind of a hack (monkey patching of python stack) and so on, but it allows to launch green threads.
I'm not sure if gevent can be combined with django, but google knows ;)
I'm currently working on a project and I'd like to integrate asynchronous task processing as well as some sort of message queue early on so that I'll be able to scale up quickly by simply adding message queue processor servers to the cluster.
I came across Celery a while back and it caught my eye. Since it's pretty well integrated with Django, I figured I'd get pretty good support with it. I'm just not really sure how to start, as there's a lot of configuration involved.
For now, I'm running just about everything out of my Django project (serving static files, pipeline, etc.) so I'd like to have a messaging queue built in to run with django runserver if possible. (Don't worry, this is only for development.) How can I get started using Celery with my existing Django project?
djkombu is now deprecated, the django transport is now directly integrated in the kombu package.
For defining the backend in your Django settings.py, you can use:
BROKER_BACKEND = "django"
You can find different transport aliases from Kombu here.
This was tested with django-celery 2.5.5, celery 2.5.3 and kombu 2.1.8.
Celery has quite a nice documentation, also for those getting started, but two facts worth being mentioned for beginners:
Use djkombu as the BROKER_BACKEND. This will give you a pretty simple message queue for development, where all messages are stored in the SQL database used by django. Due to celery's api you can easily replace it with a "real" message queue for production:
BROKER_TRANSPORT = "kombu.transport.django"
Django-celery has a setting CELERY_ALWAYS_EAGER. If set to True there will be no asynchronous background processing, all tasks that are getting called via celery will be run synchronously (so no need to start any additional celery workers - very useful for debugging as well).