Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) - django

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor.
The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:
[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).
I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?
These are my celery settings:
CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False
And this is my supervisord.conf:
[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
Edit 1
After restarting celery beat the problem remains.
Edit 2
Changed killasgroup=true to killasgroup=false and the problem remains.

The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.
Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.
grep oom /var/log/messages. If you see messages, that's your problem.
If you don't find anything, try running the periodic process manually in a shell:
MyPeriodicTask().run()
And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.

One sees this kind of error when an asynchronous task (through celery) or the script you are using is storing a lot of data in memory because it leaks.
In my case, I was getting data from another system and saving it on a variable, so I could export all data (into Django model / Excel file) after finishing the process.
Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.
To overcome the issue, I divided 10 million pieces of data into 20 parts (half a million on each part). I stored the data in my own preferred local file / Django model every time the length of data reached 500,000 items. I repeated this for every batch of 500k items.
No need to do the exact number of partitions. It is the idea of solving a complex problem by splitting it into multiple subproblems and solving the subproblems one by one. :D

Related

Symfony Messenger not shutting down gracefully when using APP_ENV=prod

We are using Symfony Messenger in combination with supervisor running in a Docker container on AWS ECS. We noticed the worker is not shut down gracefully. After debugging it appears it does work as expected when using APP_ENV=dev, but not when APP_ENV=prod.
I made a simple sleepMessage, which sleeps for 1 second and then prints a message for 60 seconds. This is when running with APP_ENV=dev
As you can see it's clearly waiting for the program to stop running.
Now with APP_ENV=prod:
It stops immediately without waiting.
In the Dockerfile we have configured the following to start supervisor. It's based on php:8.1-apache, so that's why STOPSIGNAL has been configured
RUN apt-get update && apt-get install -y --no-install-recommends \
# for supervisor
python \
supervisor
The start-worker.sh script contains this
#!/usr/bin/env bash
cp config/worker/messenger-worker.conf ../../../etc/supervisor/supervisord.conf
exec /usr/bin/supervisord
We do this because certain env variables are only available when starting up.
For debugging purposes the config has been hardcoded to test.
Below is the messenger-worker.conf
[unix_http_server]
file=/tmp/supervisor.sock
[supervisord]
nodaemon=true ; start in foreground if true; default false
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[program:messenger-consume]
stderr_logfile_maxbytes=0
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
command=bin/console messenger:consume async -vv --env=prod --time-limit=129600
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
numprocs=1
environment=
MESSENGER_TRANSPORT_DSN="https://sqs.eu-central-1.amazonaws.com/{id}/dev-
symfony-messenger-queue"
So in short, when using --env=prod in the config above it doesn't wait for the worker to stop, while with --env=dev it does. Does anybody know how to solve this?

I don't know why there would be a difference between dev & prod environment but it seems you have no grace period set (at least for Supervisor). As I added in the docs:
the workers will be able to handle the SIGTERM signal if you have the PCNTL PHP extension
you need to add stopwaitsecs to your Supervisor program configuration
As you use Docker too, you can also set the graceful period at the service level which defaults to 10s:
services:
my_app:
stop_grace_period: 20s
# ...
With this configuration, running docker-compose down (just an example):
Docker sends a SIGTERM signal to the service entrypoint (Supervisor) and waits 20s for it to exit
Supervisor sends a SIGTERM signal to its programs (messenger:consume commands) and waits 20s for them to exit
the messenger:consume processes will "catch" the signal, finish handling the current message and stop
every program stopped, Supervisor can stop, then the Docker Compose stack

Turns out it was related to the wait_time option related to SQS transports. It probably caused a request that was started just before the container exited and was sent back when the container did not exist anymore. So, wait_time to 0 fixed that problem.
Then there was this which could lead to the same issue

Running celery task on Ubuntu with supervisor

I have a test Django site using a mod_wsgi daemon process, and have set up a simple Celery task to email a contact form, and have installed supervisor.
I know that the Django code is correct.
The problem I'm having is that when I submit the form, I am only getting one message - the first one. Subsequent completions of the contact form do not send any message at all.
On my server, I have another test site with a configured supervisor task running which uses the Django server (ie it's not using mod_wsgi). Both of my tasks are running fine if I do
sudo supervisorctl status
Here is my conf file for the task I've described above which is saved at
/etc/supervisor/conf.d
the user in this instance is called myuser
[program:test_project]
command=/home/myuser/virtualenvs/test_project_env/bin/celery -A test_project worker --loglevel=INFO --concurrency=10 -n worker2#%%h
directory=/home/myuser/djangoprojects/test_project
user=myuser
numprocs=1
stdout_logfile=/var/log/celery/test_project.out.log
stderr_logfile=/var/log/celery/test_project.err.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
stopasgroup=true
; Set Celery priority higher than default (999)
; so, if rabbitmq is supervised, it will start first.
priority=1000
My other test site has this set as the command - note worker1#%%h
command=/home/myuser/virtualenvs/another_test_project_env/bin/celery -A another_test_project worker --loglevel=INFO --concurrency=10 -n worker1#%%h
I'm obviously doing something wrong in that my form is only submitted. If I look at the out.log file referred to above, I only see the first task, nothing is visible for the other form submissions.
Many thanks in advance.
UPDATE
I submitted the first form at 8.32 am (GMT) which was received, and then as described above, another one shortly thereafter for which a task was not created. Just after finishing the question, I submitted the form again at 9.15, and for this a task was created and the message received! I then submitted the form again, but no task was created again. Hope this helps!

use ps auxf|grep celery to see how many worker you started,if there is any other worker you start before and you don't kill it ,the worker you create before will consume the task,result in you every two or three(or more) times there is only one task is received.
and you need to stop celery by:
sudo supervisorctl -c /etc/supervisord/supervisord.conf stop all
everytime, and set this in supervisord.conf:
stopasgroup=true ; send stop signal to the UNIX process group (default false)
Otherwise it will causes memory leaks and regular task loss.
If you has multi django site,here is a demo support by RabbitMQ:
you need add rabbitmq vhost and set user to vhost:
sudo rabbitmqctl add_vhost {vhost_name}
sudo rabbitmqctl set_permissions -p {vhost_name} {username} ".*" ".*" ".*"
and different site use different vhost(but can use same user).
add this to your django settings.py:
BROKER_URL = 'amqp://username:password#localhost:5672/vhost_name'
some info here:
Using celeryd as a daemon with multiple django apps?
Running multiple Django Celery websites on same server
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts

2 virtual env + celery + supervisor unix:///var/run/supervisor.sock no such file

I'm using redis as the backend and I have 2 virtual env each with it's own celery workers.
I'm having a weird issue when I'm adding the celery supervisord conf of the second virtual env.
This is the error I'm getting after reloading supervisord:
unix:///var/run/supervisor.sock no such file
This is the supervisord conf file:
[program:shopify-celery]
command=dir/bin/celery worker --app=app -l warning -Q queue -n worker -P eventlet -c 3
directory=/dir
user=user
group=webapps
numprocs=1
stdout_logfile=/dir/logs/celery-worker.log
stderr_logfile=/dir/logs/celery-worker.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600pip freez
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
;killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
;priority=998
I couldn't find what might caused this.
Do you know what went wrong?

After hours of struggling with this it turned out that a missing log file did all of this.
I created the needed log file and things work perfectly now.
May be this will help someone else.

Supervisord / Celery output logs

I am using Supervisor to manage celery. The supervisor config file for celery contains the following 2 entries:
stdout_logfile = /var/log/supervisor/celery.log
stderr_logfile = /var/log/supervisor/celery_err.log
What is confusing me is that although Celery is working properly and all tasks are being successfully completed, they are all written to celery_err.log. I thought that would only be for errors. The celery.log file only shows the normal celery startup info. Is the behaviour of writing successful task completion to the error log correct?
Note - tasks are definitely completing successfully (emails being sent, db entries made etc).

I met the same phenomenon just like you.This is because of celery's loging mechanism. Just see the setup_task_loggers method of celery logger source.
If logfile is not specified, then sys.stderr is used.
So, clear enough? Celery uses sys.stderr when logfile is not specified.
Solution:
You can use supervisord redirect_stderr = true flag to combine two log files into one. I'm using this one.
Config the celery logfile option.

Is the behaviour of writing successful task completion to the error log correct?
No, its not. I am having same setup and logging is working fine.
celery.log has task info
[2015-07-23 11:40:07,066: INFO/MainProcess] Received task: foo[b5a6e0e8-1027-4005-b2f6-1ea032c73d34]
[2015-07-23 11:40:07,494: INFO/MainProcess] Task foo[b5a6e0e8-1027-4005-b2f6-1ea032c73d34] succeeded in 0.424549156s: 1
celery_err.log has some warnings/errors. Try restarting supervisor process.

One issue you could be having is that python buffers output by default
In your supervisor file, you can disable this by setting the PYTHONUNBUFFERED environment variable, see below my Django example supervisor file
[program:celery-myapp]
environment=DJANGO_SETTINGS_MODULE="myapp.settings.production",PYTHONUNBUFFERED=1
command=/home/me/.virtualenvs/myapp/bin/celery -A myapp worker -B -l DEBUG
directory=/home/me/www/saleor
user=me
stdout_logfile=/home/me/www/myapp/log/supervisor-celery.log
stderr_logfile=/home/me/www/myapp/log/supervisor-celery-err.log
autostart=true
autorestart=true
startsecs=10

running celery as daemon using supervisor is not working

I have a django app in which it has a celery functionality, so i can able to run the celery sucessfully like below
celery -A tasks worker --loglevel=info
but as a known fact that we need to run it as a daemon and so i have written the below celery.conf file inside /etc/supervisor/conf.d/ folder
; ==================================
; celery worker supervisor example
; ==================================
[program:celery]
; Set full path to celery program if using virtualenv
command=/root/Envs/proj/bin/celery -A app.tasks worker --loglevel=info
user=root
environment=C_FORCE_ROOT="yes"
environment=HOME="/root",USER="root"
directory=/root/apps/proj/structure
numprocs=1
stdout_logfile=/var/log/celery/worker.log
stderr_logfile=/var/log/celery/worker.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true
; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998
but when i tried to update the supervisor like supervisorctl reread and supervisorctl update i was getting the message from supervisorctl status
celery FATAL Exited too quickly (process log may have details)
So i went to worker.log file and seen the error message as below
Running a worker with superuser privileges when the
worker accepts messages serialized with pickle is a very bad idea!
If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).
User information: uid=0 euid=0 gid=0 egid=0
So why it was complaining about C_FORCE_ROOT even though we had set it as environment variable inside supervisor conf file ? what am i doing wrong in the above conf file ?

I had the same problem,so I added
environment=C_FORCE_ROOT="yes"
in my program config,but It didn't work
so I used
environment=C_FORCE_ROOT="true"
it's working

You'll need to run celery with a non superuser account, Please remove following lines from your config:
user=root
environment=C_FORCE_ROOT="yes"
environment=HOME="/root",USER="root"
And the add these lines to your config, I assume that you use django as a non superuser and developers as the user group:
user=django
group=developers
Note that subprocesses will inherit the environment variables of the
shell used to start supervisord except for the ones overridden here
and within the program’s environment option. See supervisord documents.
So Please note that when you change environment variables via supervisor config files, Changes won't apply by running supervisorctl reread and supervisorctl reload . You should run supervisor from the very start by following command:
supervisord -c /path/to/config/file.conf

From this other thread on stackoverflow. I managed to add the following settings and it worked for me.
app.conf.update(
CELERY_ACCEPT_CONTENT = ['json'],
CELERY_TASK_SERIALIZER = 'json',
CELERY_RESULT_SERIALIZER = 'json',
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js