I have a setup with multiple EC2's and I have celery running on all of them. I also have one box with celerybeat running on it. I'm able to get celerybeat to run with tasks running on the remaining celery clients.
Is there a way to make a required task that all celery instances have to run? The use case would be clearing logs, running basic sanity checks on the boxes, etc.
I have read the below:
http://docs.celeryproject.org/en/latest/userguide/workers.html
Actually celery works not as a push-like system, but as a pull-like one.
You just put a task into the queue and one of available workers get it and execute.
From your question I assume that under instances you mean servers on whose celery's workers are running. So you would like to run the same task on all servers.
I think you can only put some tasks (corresponding to servers number) and specify exact routing number for each task (the same as worker's id).
mytask.apply_async(kwargs={'a': 1, 'b': 2}, routing_key='aaabbc-dddeeff-243453')
mytask.apply_async(kwargs={'a': 1, 'b': 2}, routing_key='bbbbbb-fffddd-dabcfe')
...
Broadcast
Celery can also support broadcast routing. Here is an
example exchange broadcast_tasks that delivers copies of tasks to all
workers connected to it:
from kombu.common import Broadcast
CELERY_QUEUES = (Broadcast('broadcast_tasks'), )
CELERY_ROUTES = {'tasks.reload_cache': {'queue': 'broadcast_tasks'}}
Now the tasks.reload_cache task will be sent to every worker consuming
from this queue.
Related
We have a web app made with pyramid and served through gunicorn+nginx. It works with 8 worker threads/processes
We needed to jobs, we have chosen apscheduler. here is how we launch it
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
from apscheduler.scheduler import Scheduler
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
The issue is that all the worker processes of gunicorn pick the scheduler up. We tried implementing a file lock but it does not seem like a good enough solution. What would be the best way to make sure at any given time only one of the worker process picks the scheduled event up and no other thread picks it up till next JOB_INTERVAL?
The solution needs to work even with mod_wsgi in case we decide to switch to apache2+modwsgi later. It needs to work with single process development server which is waitress.
Update from the bounty sponsor
I'm facing the same issue described by the OP, just with a Django app. I'm mostly sure adding this detail won't change much if the original question. For this reason, and to gain a bit more of visibility, I also tagged this question with django.
Because Gunicorn is starting with 8 workers (in your example), this forks the app 8 times into 8 processes. These 8 processes are forked from the Master process, which monitors each of their status & has the ability to add/remove workers.
Each process gets a copy of your APScheduler object, which initially is an exact copy of your Master processes' APScheduler. This results in each "nth" worker (process) executing each job a total of "n" times.
A hack around this is to run gunicorn with the following options:
env/bin/gunicorn module_containing_app:app -b 0.0.0.0:8080 --workers 3 --preload
The --preload flag tells Gunicorn to "load the app before forking the worker processes". By doing so, each worker is "given a copy of the app, already instantiated by the Master, rather than instantiating the app itself". This means the following code only executes once in the Master process:
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Additionally, we need to set the jobstore to be anything other than :memory:.This way, although each worker is its own independent process unable of communicating with the other 7, by using a local database (rather then memory) we guarantee a single-point-of-truth for CRUD operations on the jobstore.
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = Scheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Lastly, we want to use the BackgroundScheduler because of its implementation of start(). When we call start() in the BackgroundScheduler, a new thread is spun up in the background, which is responsible for scheduling/executing jobs. This is significant because remember in step (1), due to our --preload flag we only execute the start() function once, in the Master Gunicorn process. By definition, forked processes do not inherit the threads of their Parent, so each worker doesn't run the BackgroundScheduler thread.
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = BackgroundScheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
As a result of all of this, every Gunicorn worker has an APScheduler that has been tricked into a "STARTED" state, but actually isn't running because it drops the threads of it's parent! Each instance is also capable of updating the jobstore database, just not executing any jobs!
Check out flask-APScheduler for a quick way to run APScheduler in a web-server (like Gunicorn), and enable CRUD operations for each job.
I found a fix that worked with a Django project having a very similar issue. I simply bind a TCP socket the first time the scheduler starts and check against it subsequently. I think the following code can work for you as well with minor tweaks.
import sys, socket
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("127.0.0.1", 47200))
except socket.error:
print "!!!scheduler already started, DO NOTHING"
else:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
scheduler.start()
print "scheduler started"
Short answer: You can't do it properly without consequences.
I'm using Gunicorn as an example, but it is essentially the same for uWSGI. There are various hacks when running multiple processes, to name a few:
use --preload option
use on_starting hook to start the APScheduler background scheduler
use when_ready hook to start the APScheduler background scheduler
They work to some extent but may get the following errors:
worker timing out frequently
scheduler hanging when there are no jobs https://github.com/agronholm/apscheduler/issues/305
APScheduler is designed to run in a single process where it has complete control over the process of adding jobs to job stores. It uses threading.Event's wait() and set() methods to coordinate. If they are run by different processes, the coordination wouldn't work.
It is possible to run it in Gunicorn in a single process.
use only one worker process
use the post_worker_init hook to start the scheduler, this will make sure the scheduler is run only in the worker process but not the master process
The author also pointed out sharing the job store amount multiple processes isn't possible. https://apscheduler.readthedocs.io/en/stable/faq.html#how-do-i-share-a-single-job-store-among-one-or-more-worker-processes He also provided a solution using RPyC.
While it's entirely doable to wrap APScheduler with a REST interface. You might want to consider serving it as a standalone app with one worker. In another word, if you have others endpoints, put them in another app where you can use multiple workers.
I'm having a lot of problem executing certain tasks with celery beat. Some tasks like the one below get triggered by beat but the message is never received by rabbitmq.
In my django settings file I have the following perdiodic task
CELERYBEAT_SCHEDULE = {
...
'update_locations': {
'task': 'cron.tasks.update_locations',
'schedule': crontab(hour='10', minute='0')
},
...
}
at 10 UTC beat executes the task as expected
[2015-05-13 10:00:00,046: DEBUG/MainProcess] cron.tasks.update_locations sent. id->a1c53d0e-96ca-4673-9d03-972888581176
but this message is never arrives to rabbitmq (I'm using the tracing module in rabbitmq to track incoming messages). I have several other tasks which seem to run fine but certain tasks like the one above never run. Running the tasks manually in django with cron.tasks.update_locations.delay() runs the task with no problem. Note my Rabbitmq is on a different server than beat.
Is there anything I can do to ensure the message was actually sent and/or received by rabbitmq? Is there a better or other way to schedule these tasks to ensure they run?
A bit hard to answer from these minimal descriptions.
why is this in the Django settings file? I would have expected the Celery config settings to have their own config object.
Look at http://celery.readthedocs.org/en/latest/reference/celery.html#celery.Celery.config_from_object
I am a beginner with django, I have celery installed.
I am confused about the working of the celery, if the queued works are handled synchronously or asynchronously. Can other works be queued when the queued work is already being processed?
Celery is a task queuing system, that is backed by a message queuing system, Celery allows you to invoke tasks asynchronously, in a way that won't block your process for the task to finish, you can wait for the task to finish using the AsyncResult.get.
Other tasks can be queued while a task is being processed, and if Celery is running more than one process/thread (which is the default case), tasks will be executed in parallel to each others.
It is your responsibility to make sure that related tasks are executed in the correct order, e.g. if the output of a task A is an input to the other task B then you should make sure that you get the result from task A before you start the task B.
Read Avoid launching synchronous subtasks from Celery documentation.
I think you're possibly a bit confused about what Celery does.
Celery isn't really responsible for queueing at all. That is taken care of by the queue itself - RabbitMQ, Redis, or whatever. The only way Celery gets involved at this end is as a library that you call inside your app to serialize to task into something suitable for putting onto the queue. Since that is done by your web app, it is exactly as synchronous or asynchronous as your app itself: usually, in production, you'd have multiple processes running your site, each of those could put things onto the queue simultaneously, but each queueing action is done in-process.
The main point of Celery is the separate worker processes. This is where the asynchronous bit comes from: the workers run completely separately from your web app, and pick tasks off the queue as necessary. They are not at all involved in the process of putting tasks onto the queue in the first place.
I'm trying to do something in celery that should be fairly simple, but can't see an obvious configuration for.
I've got a master host and a number of slave hosts. The master host runs a Django application that sometimes has to instruct the slaves to do things asynchronously. The task needs to be carried out by all the slaves, and there is no return type.
Celery appears to be an obvious choice. My knowledge of RabbitMQ tells me that I should have a scenario where a single fanout exchange should exist on rMQ and each celery worker should create an exclusive queue and bind to this exchange. Then, every task request published by the master will be queued into each worker and then executed by every slave.
However, looking through all the celery docs, they seem geared around the scenario where each worker carrying out the same task binds to the same queue. This won't work with a fanout exchange type, as fanout simply creates one message per connected queue.
If I was using pika and pure python, I'd simply call something like channel.queue_declare(exclusive=True) and then bind that to my exchange, ensuring that each client has its own queue and gets a copy of the message.
How do you do this in celery?
This is short, as I'm on my phone. See the celery docs -> Userguide -> Routing. You want the Broadcast entity in Kombu which do what you describe, and there's a section about it in the routing guide.
I've implemented a small test which uses celery for message queueing and I just want to make sure I understand how it works on a basic level (Django-Celery, Using Redis as a broker).
My understanding is that when I place a call to start an asyncronous task, the task information is placed in redis and then a celeryd instance connected to the broker consumes and executes the task. Is this essentially what is happening?
If I setup a periodic task thats supposed to execute once every hour does that task get executed on all task consumers? If so is there a way to limit it so that only one consumer will ever execute a periodic task?
The workers will consume as many messages as the broker contains. If you have 8 workers, but only 1 message, 1 of the 8 workers will consume the message, executing the task.