Celery and Redis keep running out of memory

Celery and Redis keep running out of memory - django

I have a Django app deployed to Heroku, with a worker process running celery (+ celerycam for monitoring). I am using RedisToGo's Redis database as a broker. I noticed that Redis keeps running out of memory.
This is what my procfile looks like:
web: python app/manage.py run_gunicorn -b "0.0.0.0:$PORT" -w 3
worker: python lipo/manage.py celerycam & python app/manage.py celeryd -E -B --loglevel=INFO
Here's the output of KEYS '*':
"_kombu.binding.celeryd.pidbox"
"celeryev.643a99be-74e8-44e1-8c67-fdd9891a5326"
"celeryev.f7a1d511-448b-42ad-9e51-52baee60e977"
"_kombu.binding.celeryev"
"celeryev.d4bd2c8d-57ea-4058-8597-e48f874698ca"
`_kombu.binding.celery"
celeryev.643a99be-74e8-44e1-8c67-fdd9891a5326 is getting filled up with these messages:
{"sw_sys": "Linux", "clock": 1, "timestamp": 1325914922.206671, "hostname": "064d9ffe-94a3-4a4e-b0c2-be9a85880c74", "type": "worker-online", "sw_ident": "celeryd", "sw_ver": "2.4.5"}
Any idea what I can do to purge these messages periodically?

Is that a solution?
in addition to _kombu.bindings.celeryev set there will be e.g. celeryev.i-am-alive. keys with TTL set (e.g. 30sec);
celeryev process adds itself to bindings and periodically (e.g. every 5 sec) updates the celeryev.i-am-alive. key to reset the TTL;
before sending the event worker process checks not only smembers on _kombu.bindings.celeryev but the individual celeryev.i-am-alive. keys as well and if key is not found (expired) then it gets removed from _kombu.bindings.celeryev (and maybe the del celeryev. or expire celeryev. commands are executed).
we can't just use keys command because it is O(N) where N is the total number of keys in DB. TTLs can be tricky on redis < 2.1 though.
expire celeryev. instead of del celeryev. can be used in order to allow temporary offline celeryev consumer to revive, but I don't know if it worths it.
author

Related

Docker/Django - How to make sure that all migrations are completed bofor application start?

at my dockerized Django application I have the following bash function at my docker-entrypoint.sh. This basically only checks if the database is available:
function check_mariadb {
while ! mysqladmin --user=$MYSQL_USER --password=$MYSQL_PASSWORD --host $MYSQL_HOST ping --silent &> /dev/null; do
echo "Waiting for MariaDB service to become available"
sleep 3
done
echo "MariaDB is up and available"
}
As my application can start in 3 modes (as application, Celery_worker or Celery_beat) I somehow have to make sure that all migration are done before celery starts. Otherwise I'm running into issues that celery is missing one of these tables:
django_celery_results_chordcounter
django_celery_results_groupresult
django_celery_results_taskresult
Can somebody give me a hint what might be the best practices to check for open migration in this context? And only let celery start if all migrations are done?!... Would be awesome if this could also be handled in a simple bash function like the one above.
Would be awesome If I could do more than just:
python manage.py showmigrations | grep '\[ \]'
Thanks in advance.

In your docker-compose.yaml, you can add a healthcheck to the Django container:
healthcheck:
test: ["CMD", "curl --fail http://localhost:8000/ || exit 1"]
interval: 10s
timeout: 5s
retries: 5
Then you can add depends_on to your celery/celerybeat container:
depends_on:
django:
condition: service_healthy
This will start the celery container only after the django healthcheck passes. In the healthcheck we simply poll localhost:8000, because when the server's returning responses, we can be sure the migrations have been applied.

There is a debian package called wait-for-it that this related thread discusses :
How to use wait-for-it in docker-compose file?
For example, I have set up a short celery.sh script file that I set as entrypoint for my beat and worker celery service in my compose file:
#!/bin/bash
set -o errexit
set -o pipefail
set -o nounset
wait-for-it web:8000
exec "$#"
Where "web" is the host name of my django service and 8000 the host port for the web service.

Wsgi number of process and threads setting in AWS Beanstalk

I have an AWS beanstalk env and have old setting of wsgi (given below), I do not have idea how does this work internally, can anybody guide me?
NumProcesses:7 -- number of process
NumThreads:5 -- number of thread in each process
How memory and cpu are being used with this configuration because there is no memory and cpu settings in AWS beanstalk level.

These parameters are part of configuration option for Python environment:
aws:elasticbeanstalk:application:environment.
They mean (from docs):
NumProcesses: The number of daemon processes that should be started for the process group when running WSGI applications (default value 1).
NumThreads: The number of threads to be created to handle requests in each daemon process within the process group when running WSGI applications (default value 15).
Internally, these values map to uwsgi or gunicorn configuration options in your EB environment. For example:
uwsgi --http :8000 --wsgi-file application.py --master --processes 4 --threads 2
Their impact on memory and cpu usage of your instance(s) is based on your application and how resource intensive it is. If you are not sure how to set them up, maybe keeping them at default values would be a good start.
The settings are also available in the EB console, under Software category:

To add on to #Marcin
Amazon linux 2 uses gunicorn
workers are processes in gunicorn
Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally, we (gunicorn creators) recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
To see how the settings in the option settings map to gunicorn you can ssh into your eb instance, go
$ eb ssh
$ cd cd /var/app/current/
$ cat Procfile
web: gunicorn --bind 127.0.0.1:8000 --workers=3 --threads=20 api.wsgi:application
--threads
A positive integer generally in the 2-4 x $(NUM_CORES) range. You’ll want to vary this a bit to find the best for your particular application’s work load.
The threads option only applies to gthread worker type. gunicons default worker class is sync, If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead automatically
based on all the above I would personally choose
workers = (2 x $NUM_CORES ) + 1
threads = 4 x $NUM_CORES
for a t3.medum instance that has 2 cores that translates to
workers = 5
threads = 8
obviously, you need to tweak this for your use case, and treat these as defaults that could very well not be right for your particular application use case, read the refs below to see how to choose the right setup for you use case
References:
REF: Gunicorn Workers and Threads
REF: https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
REF: https://docs.gunicorn.org/en/stable/settings.html#worker-class

Scheduled Celery Task Lost From Redis

I'm using Celery in Django with Redis as the Broker.
Tasks are being scheduled for the future using the eta argument in apply_async.
After scheduling the task, I can run celery -A MyApp inspect scheduled and I see the task with the proper eta for the future (24 hours in the future).
Before the scheduled time, if I restart Redis (with service redis restart) or the server reboots, running celery -A MyApp inspect scheduled again shows "- empty -".
All scheduled tasks are lost after Redis restarts.
Redis is setup with AOF, so it shouldn't be losing DB state after restarting.
EDIT
After some more research, I found out that running redis-cli -n 0 hgetall unacked both before and after the redis restart shows the tasked in the queue. So redis still has knowledge of the task, but for some reason when redis restarts, the task is removed from the worker? And then never sent again and it just stays indefinitely in the unakced queue.

Running celery task on Ubuntu with supervisor

I have a test Django site using a mod_wsgi daemon process, and have set up a simple Celery task to email a contact form, and have installed supervisor.
I know that the Django code is correct.
The problem I'm having is that when I submit the form, I am only getting one message - the first one. Subsequent completions of the contact form do not send any message at all.
On my server, I have another test site with a configured supervisor task running which uses the Django server (ie it's not using mod_wsgi). Both of my tasks are running fine if I do
sudo supervisorctl status
Here is my conf file for the task I've described above which is saved at
/etc/supervisor/conf.d
the user in this instance is called myuser
[program:test_project]
command=/home/myuser/virtualenvs/test_project_env/bin/celery -A test_project worker --loglevel=INFO --concurrency=10 -n worker2#%%h
directory=/home/myuser/djangoprojects/test_project
user=myuser
numprocs=1
stdout_logfile=/var/log/celery/test_project.out.log
stderr_logfile=/var/log/celery/test_project.err.log
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
stopasgroup=true
; Set Celery priority higher than default (999)
; so, if rabbitmq is supervised, it will start first.
priority=1000
My other test site has this set as the command - note worker1#%%h
command=/home/myuser/virtualenvs/another_test_project_env/bin/celery -A another_test_project worker --loglevel=INFO --concurrency=10 -n worker1#%%h
I'm obviously doing something wrong in that my form is only submitted. If I look at the out.log file referred to above, I only see the first task, nothing is visible for the other form submissions.
Many thanks in advance.
UPDATE
I submitted the first form at 8.32 am (GMT) which was received, and then as described above, another one shortly thereafter for which a task was not created. Just after finishing the question, I submitted the form again at 9.15, and for this a task was created and the message received! I then submitted the form again, but no task was created again. Hope this helps!

use ps auxf|grep celery to see how many worker you started,if there is any other worker you start before and you don't kill it ,the worker you create before will consume the task,result in you every two or three(or more) times there is only one task is received.
and you need to stop celery by:
sudo supervisorctl -c /etc/supervisord/supervisord.conf stop all
everytime, and set this in supervisord.conf:
stopasgroup=true ; send stop signal to the UNIX process group (default false)
Otherwise it will causes memory leaks and regular task loss.
If you has multi django site,here is a demo support by RabbitMQ:
you need add rabbitmq vhost and set user to vhost:
sudo rabbitmqctl add_vhost {vhost_name}
sudo rabbitmqctl set_permissions -p {vhost_name} {username} ".*" ".*" ".*"
and different site use different vhost(but can use same user).
add this to your django settings.py:
BROKER_URL = 'amqp://username:password#localhost:5672/vhost_name'
some info here:
Using celeryd as a daemon with multiple django apps?
Running multiple Django Celery websites on same server
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts
Run Multiple Django Apps With Celery On One Server With Rabbitmq VHosts

Gunicorn sync workers spawning processes

We're using Django + Gunicorn + Nginx in our server. The problem is that after a while we see lot's of gunicorn worker processes that have became orphan, and a lot other ones that have became zombie. Also we can see that some of Gunicorn worker processes spawn some other Gunicorn workers. Our best guess is that these workers become orphans after their parent workers have died.
Why Gunicorn workers spawn child workers? Why do they die?! And how can we prevent this?
I should also mention that we've set Gunicorn log level to debug and still we don't see any thing significant, other than periodical log of workers number, which reports count of workers we wanted from it.
UPDATE
This is the line we used to run gunicorn:
gunicorn --env DJANGO_SETTINGS_MODULE=proj.settings proj.wsgi --name proj --workers 10 --user proj --group proj --bind 127.0.0.1:7003 --log-level=debug --pid gunicorn.pid --timeout 600 --access-logfile /home/proj/access.log --error-logfile /home/proj/error.log

In my case I deploy in Ubuntu servers (LTS releases, now almost are 14.04 LTS servers) and I never did have problems with gunicorn daemons, I create a gunicorn.conf.py and launch gunicorn with this config from upstart with an script like this in /etc/init/djangoapp.conf
description "djangoapp website"
start on startup
stop on shutdown
respawn
respawn limit 10 5
script
cd /home/web/djangoapp
exec /home/web/djangoapp/bin/gunicorn -c gunicorn.conf.py -u web -g web djangoapp.wsgi
end script
I configure gunicorn with a .py file config and i setup some options (details below) and deploy my app (with virtualenv) in /home/web/djangoapp and no problems with zombie and orphans gunicorn processes.
i verified your options, timeout can be a problem but another one is that you don't setup max-requests in your config, by default is 0, so, no automatic worker restart in your daemon and can generate memory leaks (http://gunicorn-docs.readthedocs.org/en/latest/settings.html#max-requests)

We will use a .sh file to start the gunicorn process. Later you will use a supervisord configuration file. what is supervisord? some external know how information link about how to install supervisord with Django,Nginx,Gunicorn Here
gunicorn_start.sh remember to give chmod +x to the file.
#!/bin/sh
NAME="myDjango"
DJANGODIR="/var/www/html/myDjango"
NUM_WORKERS=3
echo "Starting myDjango -- Django Application"
cd $DJANGODIR
exec gunicorn -w $NUM_WORKERS $NAME.wsgi:application --bind 127.0.0.1:8001
mydjango_django.conf : Remember to install supervisord on your OS. and
Copy this on the configuration folder.
[program:myDjango]
command=/var/www/html/myDjango/gunicorn_start.sh
user=root
autorestart=true
redirect_sderr=true
Later on use the command:
Reload the daemon’s configuration files, without add/remove (no restarts)
supervisordctl reread
Restart all processes Note: restart does not reread config files. For that, see reread and update.
supervisordctl start all
Get all process status info.
supervisordctl status

This sounds like a timeout issue.
You have multiple timeouts going on and they all need to be in a descending order. It seems they may not be.
For example:
Nginx has a default timeout of 60 seconds
Gunicorn has a default timeout of 30 seconds
Django has a default timeout of 300 seconds
Postgres default timeout is complicated but let's pose 60 seconds for this example.
In this example, when 30 seconds has passed and Django is still waiting for Postgres to respond. Gunicorn tells Django to stop, which in turn should tell Postgres to stop. Gunicorn will wait a certain amount of time for this to happen before it kills django, leaving the postgres process as an orphan query. The user will re-initiate their query and this time the query will take longer because the old one is still running.
I see that you have set your Gunicorn tiemeout to 300 seconds.
This would probably mean that Nginx tells Gunicorn to stop after 60 seconds, Gunicorn may wait for Django who waits for Postgres or any other underlying processes, and when Nginx gets tired of waiting, it kills Gunicorn, leaving Django hanging.
This is still just a theory, but it is a very common problem and hopefully leads you and any others experiencing similar problems, to the right place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js