docker-compose and graceful Celery shutdown - django

I've been wondering about and searching for solutions for this and I didn't find any.
I'm running Celery in a container built with docker-compose. My container is configured like this:
celery:
build: .
container_name: cl01
env_file: ./config/variables.env
entrypoint:
- /celery-entrypoint.sh
volumes:
- ./django:/django
depends_on:
- web
- db
- redis
stop_grace_period: 1m
And my entrypoint script looks like this:
#!/bin/sh
# Wait for django
sleep 10
su -m dockeruser -c "celery -A myapp worker -l INFO"
Now, if I run docker-compose stop, I would like to have a warm (graceful) shutdown, giving Celery the provided 1 minute (stop_grace_period) to finish already started tasks. However docker-compose stop seems to kill Celery straight away. Celery should also log that it is asked to shut down gracefully, but I don't see anything but an abrupt stop to my task logs.
What am I doing wrong or what do I need to change to make Celery shut down gracefully?
edit:
Suggested answer below about providing the --timeout parameter to docker-compose stop does not solve my issue.

You need to mark celery process with exec, this way celery process will have the same ID as docker command and docker will be able to send a SIGTERM signal to it and gracefully close celery process.
# should be the last command in script
exec celery -A myapp worker -l INFO

Via docs
Usage: stop [options] [SERVICE...]
Options:
-t, --timeout TIMEOUT Specify a shutdown timeout in seconds (default: 10).
Try with timeout set to 60 seconds at least.

My experience implementing graceful shutdown for celery workers spawned by supervisord inside a docker container.
Supervisord part
supervisord.conf
...
[supervisord]
...
nodaemon=true # run supervisord in the foreground
[include]
files=celery.conf # path to the celery config file
Set nodaemon=true so that we can start it as a background process from the entrypoint script later.
celery.conf
[group:celery_workers]
programs=one, two
[program:one]
...
command=celery -A backend --config=celery.py worker -n worker_one --pidfile=/var/log/celery/worker_one.pid --pool=gevent --concurrency=10 --loglevel=INFO
killasgroup=true
stopasgroup=true
stopsignal=TERM
stopwaitsecs=600
[program:two]
...
# similar to the previous one
The configuration file above is responsible for starting a group of workers each running in a separate process within a group. I'd like to stop on a stopwaitsecs section value. Let's see what the documentation tells us about it:
This parameter sets the number of seconds to wait for the OS to return
a SIGCHLD to supervisord after the program has been sent a
stopsignal. If this number of seconds elapses before supervisord
receives a SIGCHLD from the process, supervisord will attempt to kill
it with a final SIGKILL.
If stopwaitsecs>stop_grace_period specified for your service in a docker-compose file then you'll be getting SIGKILL from your docker. Make sure
stopwaitsecs<stop_grace_period, otherwise all running tasks get interrupted by docker.
Entrypoint script part
entrypoint.sh
#!/bin/bash
# safety switch, exit script if there's error.
set -e
on_close(){
echo "Signal caught..."
echo "Supervisor is stopping processes gracefully..."
# cleanup all pid files
rm worker_one.pid
rm worker_two.pid
supervisorctl stop celery_workers:
echo "All processes have been stopped. Exiting..."
exit 1
}
start_supervisord(){
supervisord -c /etc/supervisor/supervisord.conf
}
# start trapping signals (docker sends `SIGTERM` for shutdown)
trap on_close SIGINT SIGTERM SIGKILL
start_supervisord & # start supervisord in a background
SUPERVISORD_PID=$! # PID of the last background process started
wait $SUPERVISORD_PID
EXIT_STATUS=$? # the exit status of the last command executed
The script above consists of:
registering a cleanup function on_close
starting supervisord's process group in a background
registering the last background process's PID and waiting for it to finish
Docker part
docker-compose.yml
...
services:
celery:
...
stop_grace_period: 15m30s
entrypoint: [/entrypoints/entrypoint.sh]
The only setting worth mentioning here is entrypoint form declaration. In our case better to use exec form. It starts an executable script in a process with PID 1 and doesn't create any subprocesses as shell form does. SIGTERM from docker stop <container> gets propagated to an executable which traps it and performs all cleaning and closing logic.

Try using this:
docker-compose down

Related

Symfony Messenger not shutting down gracefully when using APP_ENV=prod

We are using Symfony Messenger in combination with supervisor running in a Docker container on AWS ECS. We noticed the worker is not shut down gracefully. After debugging it appears it does work as expected when using APP_ENV=dev, but not when APP_ENV=prod.
I made a simple sleepMessage, which sleeps for 1 second and then prints a message for 60 seconds. This is when running with APP_ENV=dev
As you can see it's clearly waiting for the program to stop running.
Now with APP_ENV=prod:
It stops immediately without waiting.
In the Dockerfile we have configured the following to start supervisor. It's based on php:8.1-apache, so that's why STOPSIGNAL has been configured
RUN apt-get update && apt-get install -y --no-install-recommends \
# for supervisor
python \
supervisor
The start-worker.sh script contains this
#!/usr/bin/env bash
cp config/worker/messenger-worker.conf ../../../etc/supervisor/supervisord.conf
exec /usr/bin/supervisord
We do this because certain env variables are only available when starting up.
For debugging purposes the config has been hardcoded to test.
Below is the messenger-worker.conf
[unix_http_server]
file=/tmp/supervisor.sock
[supervisord]
nodaemon=true ; start in foreground if true; default false
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[program:messenger-consume]
stderr_logfile_maxbytes=0
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
command=bin/console messenger:consume async -vv --env=prod --time-limit=129600
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
numprocs=1
environment=
MESSENGER_TRANSPORT_DSN="https://sqs.eu-central-1.amazonaws.com/{id}/dev-
symfony-messenger-queue"
So in short, when using --env=prod in the config above it doesn't wait for the worker to stop, while with --env=dev it does. Does anybody know how to solve this?
I don't know why there would be a difference between dev & prod environment but it seems you have no grace period set (at least for Supervisor). As I added in the docs:
the workers will be able to handle the SIGTERM signal if you have the PCNTL PHP extension
you need to add stopwaitsecs to your Supervisor program configuration
As you use Docker too, you can also set the graceful period at the service level which defaults to 10s:
services:
my_app:
stop_grace_period: 20s
# ...
With this configuration, running docker-compose down (just an example):
Docker sends a SIGTERM signal to the service entrypoint (Supervisor) and waits 20s for it to exit
Supervisor sends a SIGTERM signal to its programs (messenger:consume commands) and waits 20s for them to exit
the messenger:consume processes will "catch" the signal, finish handling the current message and stop
every program stopped, Supervisor can stop, then the Docker Compose stack
Turns out it was related to the wait_time option related to SQS transports. It probably caused a request that was started just before the container exited and was sent back when the container did not exist anymore. So, wait_time to 0 fixed that problem.
Then there was this which could lead to the same issue

Running celery task on Ubuntu with supervisor

I have a test Django site using a mod_wsgi daemon process, and have set up a simple Celery task to email a contact form, and have installed supervisor.
I know that the Django code is correct.
The problem I'm having is that when I submit the form, I am only getting one message - the first one. Subsequent completions of the contact form do not send any message at all.
On my server, I have another test site with a configured supervisor task running which uses the Django server (ie it's not using mod_wsgi). Both of my tasks are running fine if I do
sudo supervisorctl status
Here is my conf file for the task I've described above which is saved at
/etc/supervisor/conf.d
the user in this instance is called myuser
[program:test_project]
command=/home/myuser/virtualenvs/test_project_env/bin/celery -A test_project worker --loglevel=INFO --concurrency=10 -n worker2#%%h
directory=/home/myuser/djangoprojects/test_project
user=myuser
numprocs=1
stdout_logfile=/var/log/celery/test_project.out.log
stderr_logfile=/var/log/celery/test_project.err.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
stopasgroup=true
; Set Celery priority higher than default (999)
; so, if rabbitmq is supervised, it will start first.
priority=1000
My other test site has this set as the command - note worker1#%%h
command=/home/myuser/virtualenvs/another_test_project_env/bin/celery -A another_test_project worker --loglevel=INFO --concurrency=10 -n worker1#%%h
I'm obviously doing something wrong in that my form is only submitted. If I look at the out.log file referred to above, I only see the first task, nothing is visible for the other form submissions.
Many thanks in advance.
UPDATE
I submitted the first form at 8.32 am (GMT) which was received, and then as described above, another one shortly thereafter for which a task was not created. Just after finishing the question, I submitted the form again at 9.15, and for this a task was created and the message received! I then submitted the form again, but no task was created again. Hope this helps!
use ps auxf|grep celery to see how many worker you started,if there is any other worker you start before and you don't kill it ,the worker you create before will consume the task,result in you every two or three(or more) times there is only one task is received.
and you need to stop celery by:
sudo supervisorctl -c /etc/supervisord/supervisord.conf stop all
everytime, and set this in supervisord.conf:
stopasgroup=true ; send stop signal to the UNIX process group (default false)
Otherwise it will causes memory leaks and regular task loss.
If you has multi django site,here is a demo support by RabbitMQ:
you need add rabbitmq vhost and set user to vhost:
sudo rabbitmqctl add_vhost {vhost_name}
sudo rabbitmqctl set_permissions -p {vhost_name} {username} ".*" ".*" ".*"
and different site use different vhost(but can use same user).
add this to your django settings.py:
BROKER_URL = 'amqp://username:password#localhost:5672/vhost_name'
some info here:
Using celeryd as a daemon with multiple django apps?
Running multiple Django Celery websites on same server
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts

2 virtual env + celery + supervisor unix:///var/run/supervisor.sock no such file

I'm using redis as the backend and I have 2 virtual env each with it's own celery workers.
I'm having a weird issue when I'm adding the celery supervisord conf of the second virtual env.
This is the error I'm getting after reloading supervisord:
unix:///var/run/supervisor.sock no such file
This is the supervisord conf file:
[program:shopify-celery]
command=dir/bin/celery worker --app=app -l warning -Q queue -n worker -P eventlet -c 3
directory=/dir
user=user
group=webapps
numprocs=1
stdout_logfile=/dir/logs/celery-worker.log
stderr_logfile=/dir/logs/celery-worker.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600pip freez
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
;killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
;priority=998
I couldn't find what might caused this.
Do you know what went wrong?
After hours of struggling with this it turned out that a missing log file did all of this.
I created the needed log file and things work perfectly now.
May be this will help someone else.

Gunicorn sync workers spawning processes

We're using Django + Gunicorn + Nginx in our server. The problem is that after a while we see lot's of gunicorn worker processes that have became orphan, and a lot other ones that have became zombie. Also we can see that some of Gunicorn worker processes spawn some other Gunicorn workers. Our best guess is that these workers become orphans after their parent workers have died.
Why Gunicorn workers spawn child workers? Why do they die?! And how can we prevent this?
I should also mention that we've set Gunicorn log level to debug and still we don't see any thing significant, other than periodical log of workers number, which reports count of workers we wanted from it.
UPDATE
This is the line we used to run gunicorn:
gunicorn --env DJANGO_SETTINGS_MODULE=proj.settings proj.wsgi --name proj --workers 10 --user proj --group proj --bind 127.0.0.1:7003 --log-level=debug --pid gunicorn.pid --timeout 600 --access-logfile /home/proj/access.log --error-logfile /home/proj/error.log
In my case I deploy in Ubuntu servers (LTS releases, now almost are 14.04 LTS servers) and I never did have problems with gunicorn daemons, I create a gunicorn.conf.py and launch gunicorn with this config from upstart with an script like this in /etc/init/djangoapp.conf
description "djangoapp website"
start on startup
stop on shutdown
respawn
respawn limit 10 5
script
cd /home/web/djangoapp
exec /home/web/djangoapp/bin/gunicorn -c gunicorn.conf.py -u web -g web djangoapp.wsgi
end script
I configure gunicorn with a .py file config and i setup some options (details below) and deploy my app (with virtualenv) in /home/web/djangoapp and no problems with zombie and orphans gunicorn processes.
i verified your options, timeout can be a problem but another one is that you don't setup max-requests in your config, by default is 0, so, no automatic worker restart in your daemon and can generate memory leaks (http://gunicorn-docs.readthedocs.org/en/latest/settings.html#max-requests)
We will use a .sh file to start the gunicorn process. Later you will use a supervisord configuration file. what is supervisord? some external know how information link about how to install supervisord with Django,Nginx,Gunicorn Here
gunicorn_start.sh remember to give chmod +x to the file.
#!/bin/sh
NAME="myDjango"
DJANGODIR="/var/www/html/myDjango"
NUM_WORKERS=3
echo "Starting myDjango -- Django Application"
cd $DJANGODIR
exec gunicorn -w $NUM_WORKERS $NAME.wsgi:application --bind 127.0.0.1:8001
mydjango_django.conf : Remember to install supervisord on your OS. and
Copy this on the configuration folder.
[program:myDjango]
command=/var/www/html/myDjango/gunicorn_start.sh
user=root
autorestart=true
redirect_sderr=true
Later on use the command:
Reload the daemon’s configuration files, without add/remove (no restarts)
supervisordctl reread
Restart all processes Note: restart does not reread config files. For that, see reread and update.
supervisordctl start all
Get all process status info.
supervisordctl status
This sounds like a timeout issue.
You have multiple timeouts going on and they all need to be in a descending order. It seems they may not be.
For example:
Nginx has a default timeout of 60 seconds
Gunicorn has a default timeout of 30 seconds
Django has a default timeout of 300 seconds
Postgres default timeout is complicated but let's pose 60 seconds for this example.
In this example, when 30 seconds has passed and Django is still waiting for Postgres to respond. Gunicorn tells Django to stop, which in turn should tell Postgres to stop. Gunicorn will wait a certain amount of time for this to happen before it kills django, leaving the postgres process as an orphan query. The user will re-initiate their query and this time the query will take longer because the old one is still running.
I see that you have set your Gunicorn tiemeout to 300 seconds.
This would probably mean that Nginx tells Gunicorn to stop after 60 seconds, Gunicorn may wait for Django who waits for Postgres or any other underlying processes, and when Nginx gets tired of waiting, it kills Gunicorn, leaving Django hanging.
This is still just a theory, but it is a very common problem and hopefully leads you and any others experiencing similar problems, to the right place.

running celery as daemon using supervisor is not working

I have a django app in which it has a celery functionality, so i can able to run the celery sucessfully like below
celery -A tasks worker --loglevel=info
but as a known fact that we need to run it as a daemon and so i have written the below celery.conf file inside /etc/supervisor/conf.d/ folder
; ==================================
; celery worker supervisor example
; ==================================
[program:celery]
; Set full path to celery program if using virtualenv
command=/root/Envs/proj/bin/celery -A app.tasks worker --loglevel=info
user=root
environment=C_FORCE_ROOT="yes"
environment=HOME="/root",USER="root"
directory=/root/apps/proj/structure
numprocs=1
stdout_logfile=/var/log/celery/worker.log
stderr_logfile=/var/log/celery/worker.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998
but when i tried to update the supervisor like supervisorctl reread and supervisorctl update i was getting the message from supervisorctl status
celery FATAL Exited too quickly (process log may have details)
So i went to worker.log file and seen the error message as below
Running a worker with superuser privileges when the
worker accepts messages serialized with pickle is a very bad idea!
If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).
User information: uid=0 euid=0 gid=0 egid=0
So why it was complaining about C_FORCE_ROOT even though we had set it as environment variable inside supervisor conf file ? what am i doing wrong in the above conf file ?
I had the same problem,so I added
environment=C_FORCE_ROOT="yes"
in my program config,but It didn't work
so I used
environment=C_FORCE_ROOT="true"
it's working
You'll need to run celery with a non superuser account, Please remove following lines from your config:
user=root
environment=C_FORCE_ROOT="yes"
environment=HOME="/root",USER="root"
And the add these lines to your config, I assume that you use django as a non superuser and developers as the user group:
user=django
group=developers
Note that subprocesses will inherit the environment variables of the
shell used to start supervisord except for the ones overridden here
and within the program’s environment option. See supervisord documents.
So Please note that when you change environment variables via supervisor config files, Changes won't apply by running supervisorctl reread and supervisorctl reload . You should run supervisor from the very start by following command:
supervisord -c /path/to/config/file.conf
From this other thread on stackoverflow. I managed to add the following settings and it worked for me.
app.conf.update(
CELERY_ACCEPT_CONTENT = ['json'],
CELERY_TASK_SERIALIZER = 'json',
CELERY_RESULT_SERIALIZER = 'json',
)