Celery --beat on Heroku vs Worker and Clock processes - django

Blockquote
I have a periodic task that I am implementing on heroku procfile using worker:
Procfile
web: gunicorn voltbe2.wsgi --log-file - --log-level debug
worker: celery -A voltbe2 worker --beat --events --loglevel INFO
tasks.py
class PullXXXActivityTask(PeriodicTask):
"""
A periodic task that fetch data every 1 mins.
"""
run_every = timedelta(minutes=1)
def run(self, **kwargs):
abc= MyModel.objects.all()
for rk in abc:
rk.pull()
logger = self.get_logger(**kwargs)
logger.info("Running periodic task for XXX.")
return True
For this periodictask, I need the --beat (I checked by turning it off, and it does not repeat the task). So, in some way, the --beat does the work of a clock (https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes)
My concern is: if I scale the worker heroku ps:scale worker=2 to 2x dynos, I am seeing that there are two beats running on worker.1 and worker.2 from the logs:
Aug 25 09:38:11 emstaging app/worker.2: [2014-08-25 16:38:11,580: INFO/Beat] Scheduler: Sending due task apps.notification.tasks.SendPushNotificationTask (apps.notification.tasks.SendPushNotificationTask)
Aug 25 09:38:20 emstaging app/worker.1: [2014-08-25 16:38:20,239: INFO/Beat] Scheduler: Sending due task apps.notification.tasks.SendPushNotificationTask (apps.notification.tasks.SendPushNotificationTask)
The log displayed is for a different periodic task, but the key point is that both worker dynos are getting signals to do the same task from their respective clocks, while in fact there should be one clock that ticks and after every XX seconds decides what to do, and gives that task to the least loaded worker.n dyno
More on why a single clock is essential is here : https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes#custom-clock-processes
Is this a problem and how to avoid this, if so?

You should have a separate worker for the beat process.
web: gunicorn voltbe2.wsgi --log-file - --log-level debug
worker: celery -A voltbe2 worker -events -loglevel info
beat: celery -A voltbe2 beat
Now you can scale the worker task without affecting the beat one.
Alternatively, if you won't always need the extra process, you can continue to use -B in the worker task but also have a second task - say, extra_worker - which is normally set to 0 dynos, but which you can scale up as necessary. The important thing is to always keep the task with the beat at 1 process

Related

Revoking Celery task causes Heroku dyno worker timeout

My Django API app has a view class that defines a post and a patch method. The post method creates a reservation resource with a user-defined expiration time and creates a Celery task which changes said resource once the user-defined expiration time has expired. The patch method gives the user the ability to update the expiration time. Changing the expiration time also requires changing the Celery task. In my patch method, I use app.control.revoke(task_id=task_id, terminate=True, signal="SIGKILL") to revoke the existing task, followed by creating a new task with the amended expiration time.
All of this works just fine in my local Docker setup. But once I deploy to Heroku the Celery worker appears to be terminated after the execution of the above app.control.revoke(...) causes a request timeout with an H12 error code.
2021-11-21T16:39:35.509103+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=PATCH path="/update-reservation-unauth/30/emailh#email.com/" host=secure-my-spot-api.herokuapp.com request_id=e0a304f8-3901-4ca1-a5eb-ba3dc06df999 fwd="108.21.217.120" dyno=web.1 connect=0ms service=30001ms status=503 bytes=0 protocol=https
2021-11-21T16:39:35.835796+00:00 app[web.1]: [2021-11-21 16:39:35 +0000] [3] [CRITICAL] WORKER TIMEOUT (pid:5)
2021-11-21T16:39:35.836627+00:00 app[web.1]: [2021-11-21 16:39:35 +0000] [5] [INFO] Worker exiting (pid: 5)
I would like to emphasise that the timeout cannot possibly be caused by calculation complexity as the Celery task only changes a boolean field on the reservation resource.
For reference, the Celery service in my docker-compose.yml looks like this:
worker:
container_name: celery
build:
context: .
dockerfile: Dockerfile
command: celery -A secure_my_spot.celeryconf worker --loglevel=INFO
depends_on:
- db
- broker
- redis
restart: always
I have looked everywhere for a solution but have not found anything that helps and I am drawing a blank as to what could possibly cause this issue with the deployed Heroku app whilst the local development containers work just fine.

How do we completed stop tasks in celery periodic_task?

Revoking a task on #periodic_task sends Discarding revoked tasks & Due task to workers.
celery-workers-screenshot
[2018-09-17 12:23:50,864: INFO/MainProcess] Received task: cimexapp.tasks.add[xxxxxxx]
[2018-09-17 12:23:50,864: INFO/MainProcess] Discarding revoked task: cimexapp.tasks.add[xxxxxxx]
[2018-09-17 12:24:00,865: INFO/Beat] Scheduler: Sending due task cimexapp.tasks.add (cimexapp.tasks.add)
[2018-09-17 12:24:00,869: INFO/MainProcess] Received task: cimexapp.tasks.add[xxxxxxx]
[2018-09-17 12:24:00,869: INFO/MainProcess] Discarding revoked task: cimexapp.tasks.add[xxxxxxx]
[2018-09-17 12:24:10,865: INFO/Beat] Scheduler: Sending due task cimexapp.tasks.add (cimexapp.tasks.add)
[2018-09-17 12:24:10,868: INFO/MainProcess] Received task: cimexapp.tasks.add[xxxxxxx]
[2018-09-17 12:24:10,869: INFO/MainProcess] Discarding revoked task: cimexapp.tasks.add[xxxxxxx]
tasks.py
#periodic_task(run_every=timedelta(seconds=10),options={"task_id":"xxxxxxx"})
def add():
call(["ping","-c10","google.com"])
def stop():
x = revoke("xxxxxxx",terminate=True,signal="KILL")
print(x)
print('DONE')
I had created task_id with a name so that it could be easy for me to kill it by calling id.
How do I completely stop it from sending tasks ?
I don't want to kill all workers with
pkill -9 -f 'celery worker'
celery -A PROJECTNAME control shutdown
I just want to stop the tasks/workers for add() function.
One possible way is to store the tasks in the database and add remove tasks dynamically. You can use database backed celery beat scheduler for the same. Refer https://django-celery-beat.readthedocs.io/en/latest/. PeriodicTask database store the periodic tasks. You can manipulate the periodic task by using database commands (Django ORM).
This is how I handled the dynamic tasks (Create and stop tasks dynamically).
from django_celery_beat.models import PeriodicTask, IntervalSchedule, CrontabSchedule
chon_schedule = CrontabSchedule.objects.create(minute='40', hour='08', day_of_week='*', day_of_month='*', month_of_year='*') # To create a cron schedule.
schedule = IntervalSchedule.objects.create(every=10, period=IntervalSchedule.SECONDS) # To create a schedule to run everu 10 min.
PeriodicTask.objects.create(crontab=chon_schedule, name='name_to_identify_task',task='name_of_task') # It creates a entry in the database describing that periodic task (With cron schedule).
task = PeriodicTask.objects.create(interval=schedule, name='run for every 10 min', task='for_each_ten_min', ) # It creates a periodic task with interval schedule
Whenever you update a PeriodicTask a counter in this table is also
incremented, which tells the celery beat service to reload the
schedule from the database.
So you don't need to restart the or kill the beat.
If you want to stop a task when particular criteria met then
periodic_task = PeriodicTask.objects.get(name='run for every 10 min')
periodic_task.enabled = False
periodic_task.save()
When enabled is False then the periodic task becomes idle. You can again make it active by making enable = True.
If you no longer needs the task then you can simply delete the entry.
When creating your Project model object create periodic task too. Just create cron schedule or interval schedule based on your scenario. Then Create PeriodicTask object you can give Project.name to PeriodicTask name(so you can easily relate the project object with PeriodicTask object. Thats it, From that moment onwards the task will handle by celery beat.
If you want to disable or enable Periodic task dynamically just set enabled flag in PeriodicTask as follows
task = PeriodicTask.objects.get(name='task_name')
task.enabled = False
task.save()
In Celery 4.2.2, you can remove all beat tasks by executing the following command
celery -A YourProjectName purge
Remember to replace YourProjectName.
For more info, check out this page

docker-compose and graceful Celery shutdown

I've been wondering about and searching for solutions for this and I didn't find any.
I'm running Celery in a container built with docker-compose. My container is configured like this:
celery:
build: .
container_name: cl01
env_file: ./config/variables.env
entrypoint:
- /celery-entrypoint.sh
volumes:
- ./django:/django
depends_on:
- web
- db
- redis
stop_grace_period: 1m
And my entrypoint script looks like this:
#!/bin/sh
# Wait for django
sleep 10
su -m dockeruser -c "celery -A myapp worker -l INFO"
Now, if I run docker-compose stop, I would like to have a warm (graceful) shutdown, giving Celery the provided 1 minute (stop_grace_period) to finish already started tasks. However docker-compose stop seems to kill Celery straight away. Celery should also log that it is asked to shut down gracefully, but I don't see anything but an abrupt stop to my task logs.
What am I doing wrong or what do I need to change to make Celery shut down gracefully?
edit:
Suggested answer below about providing the --timeout parameter to docker-compose stop does not solve my issue.
You need to mark celery process with exec, this way celery process will have the same ID as docker command and docker will be able to send a SIGTERM signal to it and gracefully close celery process.
# should be the last command in script
exec celery -A myapp worker -l INFO
Via docs
Usage: stop [options] [SERVICE...]
Options:
-t, --timeout TIMEOUT Specify a shutdown timeout in seconds (default: 10).
Try with timeout set to 60 seconds at least.
My experience implementing graceful shutdown for celery workers spawned by supervisord inside a docker container.
Supervisord part
supervisord.conf
...
[supervisord]
...
nodaemon=true # run supervisord in the foreground
[include]
files=celery.conf # path to the celery config file
Set nodaemon=true so that we can start it as a background process from the entrypoint script later.
celery.conf
[group:celery_workers]
programs=one, two
[program:one]
...
command=celery -A backend --config=celery.py worker -n worker_one --pidfile=/var/log/celery/worker_one.pid --pool=gevent --concurrency=10 --loglevel=INFO
killasgroup=true
stopasgroup=true
stopsignal=TERM
stopwaitsecs=600
[program:two]
...
# similar to the previous one
The configuration file above is responsible for starting a group of workers each running in a separate process within a group. I'd like to stop on a stopwaitsecs section value. Let's see what the documentation tells us about it:
This parameter sets the number of seconds to wait for the OS to return
a SIGCHLD to supervisord after the program has been sent a
stopsignal. If this number of seconds elapses before supervisord
receives a SIGCHLD from the process, supervisord will attempt to kill
it with a final SIGKILL.
If stopwaitsecs>stop_grace_period specified for your service in a docker-compose file then you'll be getting SIGKILL from your docker. Make sure
stopwaitsecs<stop_grace_period, otherwise all running tasks get interrupted by docker.
Entrypoint script part
entrypoint.sh
#!/bin/bash
# safety switch, exit script if there's error.
set -e
on_close(){
echo "Signal caught..."
echo "Supervisor is stopping processes gracefully..."
# cleanup all pid files
rm worker_one.pid
rm worker_two.pid
supervisorctl stop celery_workers:
echo "All processes have been stopped. Exiting..."
exit 1
}
start_supervisord(){
supervisord -c /etc/supervisor/supervisord.conf
}
# start trapping signals (docker sends `SIGTERM` for shutdown)
trap on_close SIGINT SIGTERM SIGKILL
start_supervisord & # start supervisord in a background
SUPERVISORD_PID=$! # PID of the last background process started
wait $SUPERVISORD_PID
EXIT_STATUS=$? # the exit status of the last command executed
The script above consists of:
registering a cleanup function on_close
starting supervisord's process group in a background
registering the last background process's PID and waiting for it to finish
Docker part
docker-compose.yml
...
services:
celery:
...
stop_grace_period: 15m30s
entrypoint: [/entrypoints/entrypoint.sh]
The only setting worth mentioning here is entrypoint form declaration. In our case better to use exec form. It starts an executable script in a process with PID 1 and doesn't create any subprocesses as shell form does. SIGTERM from docker stop <container> gets propagated to an executable which traps it and performs all cleaning and closing logic.
Try using this:
docker-compose down

Supervisord / Celery output logs

I am using Supervisor to manage celery. The supervisor config file for celery contains the following 2 entries:
stdout_logfile = /var/log/supervisor/celery.log
stderr_logfile = /var/log/supervisor/celery_err.log
What is confusing me is that although Celery is working properly and all tasks are being successfully completed, they are all written to celery_err.log. I thought that would only be for errors. The celery.log file only shows the normal celery startup info. Is the behaviour of writing successful task completion to the error log correct?
Note - tasks are definitely completing successfully (emails being sent, db entries made etc).
I met the same phenomenon just like you.This is because of celery's loging mechanism. Just see the setup_task_loggers method of celery logger source.
If logfile is not specified, then sys.stderr is used.
So, clear enough? Celery uses sys.stderr when logfile is not specified.
Solution:
You can use supervisord redirect_stderr = true flag to combine two log files into one. I'm using this one.
Config the celery logfile option.
Is the behaviour of writing successful task completion to the error log correct?
No, its not. I am having same setup and logging is working fine.
celery.log has task info
[2015-07-23 11:40:07,066: INFO/MainProcess] Received task: foo[b5a6e0e8-1027-4005-b2f6-1ea032c73d34]
[2015-07-23 11:40:07,494: INFO/MainProcess] Task foo[b5a6e0e8-1027-4005-b2f6-1ea032c73d34] succeeded in 0.424549156s: 1
celery_err.log has some warnings/errors. Try restarting supervisor process.
One issue you could be having is that python buffers output by default
In your supervisor file, you can disable this by setting the PYTHONUNBUFFERED environment variable, see below my Django example supervisor file
[program:celery-myapp]
environment=DJANGO_SETTINGS_MODULE="myapp.settings.production",PYTHONUNBUFFERED=1
command=/home/me/.virtualenvs/myapp/bin/celery -A myapp worker -B -l DEBUG
directory=/home/me/www/saleor
user=me
stdout_logfile=/home/me/www/myapp/log/supervisor-celery.log
stderr_logfile=/home/me/www/myapp/log/supervisor-celery-err.log
autostart=true
autorestart=true
startsecs=10

Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor.
The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:
[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).
I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?
These are my celery settings:
CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False
And this is my supervisord.conf:
[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
Edit 1
After restarting celery beat the problem remains.
Edit 2
Changed killasgroup=true to killasgroup=false and the problem remains.
The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.
Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.
grep oom /var/log/messages. If you see messages, that's your problem.
If you don't find anything, try running the periodic process manually in a shell:
MyPeriodicTask().run()
And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.
One sees this kind of error when an asynchronous task (through celery) or the script you are using is storing a lot of data in memory because it leaks.
In my case, I was getting data from another system and saving it on a variable, so I could export all data (into Django model / Excel file) after finishing the process.
Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.
To overcome the issue, I divided 10 million pieces of data into 20 parts (half a million on each part). I stored the data in my own preferred local file / Django model every time the length of data reached 500,000 items. I repeated this for every batch of 500k items.
No need to do the exact number of partitions. It is the idea of solving a complex problem by splitting it into multiple subproblems and solving the subproblems one by one. :D