How to restart Celery gracefully without delaying tasks - django

We use Celery with our Django webapp to manage offline tasks; some of these tasks can run up to 120 seconds.
Whenever we make any code modifications, we need to restart Celery to have it reload the new Python code. Our current solution is to send a SIGTERM to the main Celery process (kill -s 15 `cat /var/run/celeryd.pid`), then to wait for it to die and restart it (python manage.py celeryd --pidfile=/var/run/celeryd.pid [...]).
Because of the long-running tasks, this usually means the shutdown will take a minute or two, during which no new tasks are processed, causing a noticeable delay to users currently on the site. I'm looking for a way to tell Celery to shutdown, but then immediately launch a new Celery instance to start running new tasks.
Things that didn't work:
Sending SIGHUP to the main process: this caused Celery to attempt to "restart," by doing a warm shutdown and then relaunching itself. Not only does this take a long time, it doesn't even work, because apparently the new process launches before the old one dies, so the new one complains ERROR: Pidfile (/var/run/celeryd.pid) already exists. Seems we're already running? (PID: 13214) and dies immediately. (This looks like a bug in Celery itself; I've let them know about it.)
Sending SIGTERM to the main process and then immediately launching a new instance: same issue with the Pidfile.
Disabling the Pidfile entirely: without it, we have no way of telling which of the 30 Celery process are the main process that needs to be sent a SIGTERM when we want it to do a warm shutdown. We also have no reliable way to check if the main process is still alive.

celeryd has --autoreload option. If enabled, celery worker (main process) will detect changes in celery modules and restart all worker processes. In contrast to SIGHUP signal, autoreload restarts each process independently when the current executing task finishes. It means while one worker process is restarting the remaining processes can execute tasks.
http://celery.readthedocs.org/en/latest/userguide/workers.html#autoreloading

I've recently fixed the bug with SIGHUP: https://github.com/celery/celery/pull/662

rm *.pyc
This causes the updated tasks to be reloaded. I discovered this trick recently, I just hope there are no nasty side effects.

Well you using SIGHUP (1) for warm shutdown of celery. I am not sure if it actually causes a warm shutdown. But SIGINT (2) would cause a warm shutdown. Try SIGINT in place of SIGHUP and then start celery manually in your script (I guess).

Can you launch it with a custom pid file name. Possibly timestamped, and key off of that to know which PID to kill?
CELERYD_PID_FILE="/var/run/celery/%n_{timestamp}.pid"
^I dont know the timestamp syntax but maybe you do or you can find it?
then use the current system time to kill off any old pids and launch a new one?

A little late, but that can fixed by deleting the file called celerybeat.pid.
Worked for me.

I think you can try this:
kill -s HUP ``cat /var/run/celeryd.pid``
python manage.py celeryd --pidfile=/var/run/celeryd.pid
HUP may recycle every free worker and leave executing workers keep running and HUP will let these workers be trusted. Then you can safely restart a new celery worker main process and workers. Old workers may be killed itself when task has been finished.
I've use this way in our production and it seems safe now. Hope this can help you!

Related

Celery worker stop without any error after task completion

Celery worker running in ubuntu stops without any error after the process completion But The celery beat keeps running without any issues.
Recently this problem occurs when deployed to the new server the DB schema and some code also were changed previously it was running fine both worker and beat.
Comand use to run celery worker
celery -A base worker -l info --detach --logfile=logs/celery.log -n celery_worker
No error after the process completes which takes around 80, 90 min to complete.
After completion, there is no celery worker process running and the next task doesn't execute.
What might be the issue here how can I debug it?

Exit server terminal while after celery execution

I have successfully created a periodic task which updates each minute, in a django app. I everything is running as expected, using celery -A proj worker -B.
I am aware that using celery -A proj worker -B to execute the task is not advised, however, it seems to be the only way for the task to be run periodically.
I am logging on to the server using GitBash, after execution, I would like to exit GitBash with the celery tasks still being executed periodically.
When I press ctrl+fn+shift it is a cold worker exit, which stops execution completely (which is not desirable).
Any help?
If you are on a linux server, You might want to use a process manager like supervisord or even systemd to keep your process running.
On windows, one might look at running celery as a service or running as part of rabbitMQ.
In WSL, it seems like a bat file will get wsl commands to run as a service.

Celery task worker not updating in production

I have set up a django project on an EC2 instance with SQS as broker for celery, running through Supervisord. The problem started when I updated the parameter arguments for a task. On calling the task, I get an error on Sentry which clearly shows that the task is running the old code. How do I update it?
I have tried supervisorctl restart all but still there are issues. The strange thing is that for some arguments, the updated code runs while for some it does not.
I checked the logs for the celery worker and it doesn't receive the tasks which give me the error. I am running -P solo so there is only one worker (Ran ps auxww | grep 'celery worker' to check). Then who else is processing those tasks?
Any kind of help is appreciated.
P.S. I use RabbitMQ for local development and it works totally fine
Never use the same queue in different environments.

Check if celery beat is up and running

In my Django project, I use Celery and Rabbitmq to run tasks in background.
I am using celery beat scheduler to run periodic tasks.
How can i check if celery beat is up and running, programmatically?
Make a task to HTTP requests to a Ping URL at regular intervals. When the URL is not pinged on time, the URL monitor will send you an alert.
import requests
from yourapp.celery_config import app
#app.task
def ping():
print '[healthcheck] pinging alive status...'
# healthchecks.io works for me:
requests.post("https://hchk.io/6466681c-7708-4423-adf0-XXXXXXXXX")
This celery periodic task is scheduled to run every minute, if it doesn't hit the ping, your beat service is down*, the monitor will kick in your mail (or webhook so you can zapier it to get mobile push notifications as well).
celery -A yourapp.celery_config beat -S djcelery.schedulers.DatabaseScheduler
*or overwhelmed, you should track tasks saturation, this is a nightmare with Celery and should be detected and addressed properly, happens frequently when the workers are busy with blocking tasks that would need optimization
Are you use upstart or supervison or something else to run celery workers + celery beat as a background tasks? In production you should use one of them to run celery workers + celery beat in background.
Simplest way to check celery beat is running: ps aux | grep -i '[c]elerybeat'. If you get text string with pid it's running. Also you can make output of this command more pretty: ps aux | grep -i '[c]elerybeat' | awk '{print $2}'. If you get number - it's working, if you get nothing - it's not working.
Also you can check celery workers status: celery -A projectname status.
If you intrested in advanced celery monitoring you can read official documentation monitoring guide.
If you have daemonized celery following the tutorial of the celery doc, checking if it's running or not can be done through
sudo /etc/init.d/celeryd status
sudo /etc/init.d/celerybeat status
You can use the return of such commands in a python module.
You can probably look up supervisor.
It provides a celerybeat conf which logs everything related to beat in /var/log/celery/beat.log.
Another way of going about this is to use Flower. You can set it up for your server (make sure its password protected), it somewhat becomes easier to notice in the GUI the tasks which are being queued and what time they are queued thus verifying if your beat is running fine.
I have recently used a solution similar to what #panchicore suggested, for the same problem.
Problem in my workplace was an important system working with celery beat, and once in a while, either due to RabbitMQ outage, or some connectivity issue between our servers and RabbitMQ server, due to which celery beat just stopped triggering crons anymore, unless restarted.
As we didn't have any tool handy, to monitor keep alive calls sent over HTTP, we have used statsd for the same purpose. There's a counter incremented on statsd server every minute(done by a celery task), and then we setup email & slack channel alerts on the grafana metrics. (no updates for 10 minutes == outage)
I understand it's not purely a programatic approach, but any production level monitoring/alerting isn't complete without a separate monitoring entity.
The programming part is as simple as it can be. A tiny celery task running every minute.
#periodic_task(run_every=timedelta(minutes=1))
def update_keep_alive(self):
logger.info("running keep alive task")
statsd.incr(statsd_tags.CELERY_BEAT_ALIVE)
A problem that I have faced with this approach, is due to STATSD packet losses over UDP. So use TCP connection to STATSD for this purpose, if possible.
You can check scheduler running or not by the following command
python manage.py celery worker --beat
While working on a project recently, I used this:
HEALTHCHECK CMD ["stat celerybeat.pid || exit 1"]
Essentially, the beat process writes a pid file under some location (usually the home location), all you have to do is to get some stats to check if the file is there.
Note: This worked while launching a standalone celery beta process in a Docker container
The goal of liveness for celery beat/scheduler is to check if the celery beat/scheduler is able to send the job to the message broker so that it can be picked up by the respective consumer. [Is it still working or in a hung state]. The celery worker and celery scheduler/beat may or may not be running in the same pod or instance.
To handle such scenarios, we can create a method update_scheduler_liveness with decorator #after_task_publish.connect which will be called every time when the scheduler successfully publishes the message/task to the message broker.
The method update_scheduler_liveness will update the current timestamp to a file every time when the task is published successfully.
In Liveness probe, we need to check the last updated timestamp of the file either using:
stat --printf="%Y" celery_beat_schedule_liveness.stat command
or we can explicitly try to read the file (read mode) and extract the timestamp and compare if the the timestamp is recent or not based on the liveness probe criteria.
In this approach, the more minute liveness criteria you need, the more frequent a job must be triggered from the celery beat. So, for those cases, where the frequency between jobs is pretty huge, a custom/dedicated liveness heartbeat job can be scheduled every 2-5 mins and the consumer can just process it. #after_task_publish.connect decorator provides multiple arguments that can be also used for filtering of liveness specific job that were triggered
If we don't want to go for file based approach, then we can rely on Redis like data-source with instance specific redis key as well which needs to be implemented on the same lines.

Issues with celery daemon

We're having issues with our celery daemon being very flaky. We use a fabric deployment script to restart the daemon whenever we push changes, but for some reason this is causing massive issues.
Whenever the deployment script is run the celery processes are left in some pseudo dead state. They will (unfortunately) still consume tasks from rabbitmq, but they won't actually do anything. Confusingly a brief inspection would indicate everything seems to be "fine" in this state, celeryctl status shows one node online and ps aux | grep celery shows 2 running processes.
However, attempting to run /etc/init.d/celeryd stop manually results in the following error:
start-stop-daemon: warning: failed to kill 30360: No such process
While in this state attempting to run celeryd start appears to work correctly, but in fact does nothing. The only way to fix the issue is to manually kill the running celery processes and then start them again.
Any ideas what's going on here? We also don't have complete confirmation, but we think the problem also develops after a few days (with no activity this is a test server currently) on it's own with no deployment.
I can't say that I know what's ailing your setup, but I've always used supervisord to run celery -- maybe the issue has to do with upstart? Regardless, I've never experienced this with celery running on top of supervisord.
For good measure, here's a sample supervisor config for celery:
[program:celeryd]
directory=/path/to/project/
command=/path/to/project/venv/bin/python manage.py celeryd -l INFO
user=nobody
autostart=true
autorestart=true
startsecs=10
numprocs=1
stdout_logfile=/var/log/sites/foo/celeryd_stdout.log
stderr_logfile=/var/log/sites/foo/celeryd_stderr.log
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
Restarting celeryd in my fab script is then as simple as issuing a sudo supervisorctl restart celeryd.