H13 (Connection closed without response) errors on Heroku scale down

H13 (Connection closed without response) errors on Heroku scale down - django

I am running Django app in Docker image with uWSGI, supervisor and nginx on Heroku.
I am often getting H13 (Connection closed without response) errors when the app is scaling down:
This problem generates following log events:
2022-10-12T09:35:13.231318+00:00 heroku web.3 - - State changed from up to down
2022-10-12T09:35:13.774228+00:00 heroku web.3 - - Stopping all processes with SIGTERM
2022-10-12T09:35:14.028602+00:00 heroku router - - at=error code=H13 desc="Connection closed without response" method=GET path="/comments/api/assets-uuidasset/xxxx-xxxx-xxxx-xxxx-xxxxx/count/?_=1665564563"
I expect the problem lies in either the socket not closing on SIGTERM signal or nginx closing ungracefully with SIGTERM signal (it should receive SIGQUIT for graceful shutdown) or something similar.
The first case is described in this article regarding Puma and Ruby: https://www.schneems.com/2019/07/12/puma-4-hammering-out-h13sa-debugging-story/
The second case is described here: https://canonical.com/blog/avoiding-dropped-connections-in-nginx-containers-with-stopsignal-sigquit

After three weeks of work I had been finally able to fix this issue.
Short answer:
Avoid using Heroku to run Docker images if you can.
Heroku sends SIGTERM to ALL processes in the dyno, which is something that is very hard to deal with. You will need patch almost every process inside the Docker container to count with SIGTERM and terminate nicely.
Standard way of terminating Docker container is with docker stop command which sends SIGTERM ONLY to root process (entrypoint), where it can be dealt with.
Heroku has very arbitrary process of terminating the instance incompatible with existing applications as well as existing Docker image deployments. And according to my communication with Heroku they are unable to change this in the future.
Long answer:
There was not one single issue but 5 separate issues.
In order to terminate the instance successfully following conditions needs to be fulfilled:
Nginx has to be terminate first and start last (so the Heroku router stops sending requests, this is similar to Puma) and it has to be graceful, which is usually done with SIGQUIT signal.
Other applications needs to terminate gracefully in correct order - in my case first Nginx, than Gunicorn and PGBouncer as the last. The order of terminating the applications is important - e.g. PGBouncer must terminate after Gunicorn to not interrupt running SQL queries.
The docker-entrypoint.sh needs to catch the SIGTERM signal. This didn't show up when I was testing locally.
In order to achieve this I had to deal with every application separately:
Nginx:
I had to patch Nginx to swich SIGTERM and SIGQUIT signals, so I run following command in my Dockerfile:
# Compile nginx and patch it to switch SIGTERM and SIGQUIT signals
RUN curl -L http://nginx.org/download/nginx-1.22.0.tar.gz -o nginx.tar.gz \
&& tar -xvzf nginx.tar.gz \
&& cd nginx-1.22.0 \
&& sed -i "s/ QUIT$/TIUQ/g" src/core/ngx_config.h \
&& sed -i "s/ TERM$/QUIT/g" src/core/ngx_config.h \
&& sed -i "s/ TIUQ$/TERM/g" src/core/ngx_config.h \
&& ./configure --without-http_rewrite_module \
&& make \
&& make install \
&& cd .. \
&& rm nginx-1.22.0 -rf \
&& rm nginx.tar.gz
Issue I created
uWSGI/Gunicorn:
I gave up on uWSGI and swiched to Gunicorn (which terminates gracefully on SIGTERM), but I had to patch it anyways in the end, because it needs to terminate later than Nginx. I disabled SIGTERM signal and mapped it's function on SIGUSR1
My patched version is here: https://github.com/PetrDlouhy/gunicorn/commit/1414112358f445ce714c5d4f572d78172b993b79
I install it with:
RUN poetry run pip install -e git+https://github.com/PetrDlouhy/gunicorn#no_sigterm#egg=gunicorn[gthread] \
&& cd `poetry env info -p`/src/gunicorn/ \
&& git config core.repositoryformatversion 0 # Needed for Dockerfile.test only untill next version of Dulwich is released \
&& cd /project
Issue I created
PGBouncer:
I also deployed PGBouncer which I had to modify to not react on SIGTERM with:
# Compile pgbouncer and patch it to switch SIGTERM and SIGQUIT signals
RUN curl -L https://github.com/pgbouncer/pgbouncer/releases/download/pgbouncer_1_17_0/pgbouncer-1.17.0.tar.gz -o pgbouncer.tar.gz \
&& tar -xvzf pgbouncer.tar.gz \
&& cd pgbouncer-1.17.0 \
&& sed -i "s/got SIGTERM, fast exit/PGBouncer got SIGTERM, do nothing/" src/main.c \
&& sed -i "s/ exit(1);$//g" src/main.c \
&& ./configure \
&& make \
&& make install \
&& cd .. \
&& rm pgbouncer-1.17.0 -rf \
&& rm pgbouncer.tar.gz
It still can be brought down gracefully with SIGINT.
Issue I created
docker-entrypoint.sh
I had to trap SIGTERM in my docker-entrypoint.sh with:
_term() {
echo "Caught SIGTERM signal. Do nothing here, because Heroku already sent signal everywhere."
}
trap _term SIGTERM
supervisor
In order to not receive R12 errors all processes needs to terminate before 30 second Heroku graceful period. I achieved it by setting priorities in supervisord.conf:
[supervisord]
nodaemon=true
[program:gunicorn]
command=poetry run newrelic-admin run-program gunicorn wsgi:application -c /etc/gunicorn/gunicorn.conf.py
priority=2
stopsignal=USR1
...
[program:nginx]
command=/usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
priority=3
...
[program:pgbouncer]
command=/usr/local/bin/pgbouncer /project/pgbouncer/pgbouncer.ini
priority=1
stopsignal=INT
...
Testing the solutions:
In order to test what was going on, I had to develop some testing techniques which might come handy in different but similar cases.
I created a view which waits 10 seconds before answer and bind it on /slow_view url.
Then I started the server in Docker instance, made query to the slow view with curl -I "http://localhost:8080/slow_view" and made second connection to the Docker instance and executed kill command with pkill -SIGTERM . or e.g. pkill -SIGTERM gunicorn.
I could also run the kill command on testing Heroku dyno where I connected with heroku ps:exec --dyno web.1 --app my_app.

Related

Symfony Messenger not shutting down gracefully when using APP_ENV=prod

We are using Symfony Messenger in combination with supervisor running in a Docker container on AWS ECS. We noticed the worker is not shut down gracefully. After debugging it appears it does work as expected when using APP_ENV=dev, but not when APP_ENV=prod.
I made a simple sleepMessage, which sleeps for 1 second and then prints a message for 60 seconds. This is when running with APP_ENV=dev
As you can see it's clearly waiting for the program to stop running.
Now with APP_ENV=prod:
It stops immediately without waiting.
In the Dockerfile we have configured the following to start supervisor. It's based on php:8.1-apache, so that's why STOPSIGNAL has been configured
RUN apt-get update && apt-get install -y --no-install-recommends \
# for supervisor
python \
supervisor
The start-worker.sh script contains this
#!/usr/bin/env bash
cp config/worker/messenger-worker.conf ../../../etc/supervisor/supervisord.conf
exec /usr/bin/supervisord
We do this because certain env variables are only available when starting up.
For debugging purposes the config has been hardcoded to test.
Below is the messenger-worker.conf
[unix_http_server]
file=/tmp/supervisor.sock
[supervisord]
nodaemon=true ; start in foreground if true; default false
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[program:messenger-consume]
stderr_logfile_maxbytes=0
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
command=bin/console messenger:consume async -vv --env=prod --time-limit=129600
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
numprocs=1
environment=
MESSENGER_TRANSPORT_DSN="https://sqs.eu-central-1.amazonaws.com/{id}/dev-
symfony-messenger-queue"
So in short, when using --env=prod in the config above it doesn't wait for the worker to stop, while with --env=dev it does. Does anybody know how to solve this?

I don't know why there would be a difference between dev & prod environment but it seems you have no grace period set (at least for Supervisor). As I added in the docs:
the workers will be able to handle the SIGTERM signal if you have the PCNTL PHP extension
you need to add stopwaitsecs to your Supervisor program configuration
As you use Docker too, you can also set the graceful period at the service level which defaults to 10s:
services:
my_app:
stop_grace_period: 20s
# ...
With this configuration, running docker-compose down (just an example):
Docker sends a SIGTERM signal to the service entrypoint (Supervisor) and waits 20s for it to exit
Supervisor sends a SIGTERM signal to its programs (messenger:consume commands) and waits 20s for them to exit
the messenger:consume processes will "catch" the signal, finish handling the current message and stop
every program stopped, Supervisor can stop, then the Docker Compose stack

Turns out it was related to the wait_time option related to SQS transports. It probably caused a request that was started just before the container exited and was sent back when the container did not exist anymore. So, wait_time to 0 fixed that problem.
Then there was this which could lead to the same issue

Thousands of extraneous gunicorn workers

I'm using gunicorn 19.7.1 appserver with nginx reverse proxy for a Django project (Ubuntu 14.04 machine).
ps aux | grep gunicorn | grep -v grep | wc -l yields 3043 at the moment.
Whereas in /etc/init/gunicorn.conf, I've always had -w 33. Yet these extra workers persist even if I do sudo service gunicorn stop and sudo service gunicorn start.
How do I kill the extraneous workers?
How did this happen?
The worker count of 33 has always been properly configured on my busy production system.
However a few hours ago, I was trying python's multiprocessing on the server and things went south. Gunicorn workers ate up all the memory and took out the resident redis instances as well.
I reverted the change and have managed to get everything back online, except the memory hasn't been released and I've had to cope with these legacy gunicorn workers. What's going on?

Yet these extra workers persist even if I do sudo service gunicorn stop and sudo service gunicorn start.
service only manages service-initiated processes, so if you started Gunicorn workers outside of the service framework, these workers will continue to live even if you stop.
How do I kill the extraneous workers?
The fast way:
Run this command to list all gunicorn process IDs and terminate them, and then restart Gunicorn:
$ pkill gunicorn
$ sudo service gunicorn start
The better way:
Identify your "desired" Gunicorn workers by finding the parent:
$ sudo service gunicorn status
Note the parent process ID. Let's say it's 123.
Save a list of all the "desired" workers' PIDs:
$ echo 123 > desired_workers
$ pgrep -P 123 >> desired_workers
Save a list of all workers' PIDs:
$ pgrep gunicorn > all_workers
Terminate the "undesired" workers:
$ cat desired_workers all_workers | sort | uniq -u | xargs kill

docker-compose and graceful Celery shutdown

I've been wondering about and searching for solutions for this and I didn't find any.
I'm running Celery in a container built with docker-compose. My container is configured like this:
celery:
build: .
container_name: cl01
env_file: ./config/variables.env
entrypoint:
- /celery-entrypoint.sh
volumes:
- ./django:/django
depends_on:
- web
- db
- redis
stop_grace_period: 1m
And my entrypoint script looks like this:
#!/bin/sh
# Wait for django
sleep 10
su -m dockeruser -c "celery -A myapp worker -l INFO"
Now, if I run docker-compose stop, I would like to have a warm (graceful) shutdown, giving Celery the provided 1 minute (stop_grace_period) to finish already started tasks. However docker-compose stop seems to kill Celery straight away. Celery should also log that it is asked to shut down gracefully, but I don't see anything but an abrupt stop to my task logs.
What am I doing wrong or what do I need to change to make Celery shut down gracefully?
edit:
Suggested answer below about providing the --timeout parameter to docker-compose stop does not solve my issue.

You need to mark celery process with exec, this way celery process will have the same ID as docker command and docker will be able to send a SIGTERM signal to it and gracefully close celery process.
# should be the last command in script
exec celery -A myapp worker -l INFO

Via docs
Usage: stop [options] [SERVICE...]
Options:
-t, --timeout TIMEOUT Specify a shutdown timeout in seconds (default: 10).
Try with timeout set to 60 seconds at least.

My experience implementing graceful shutdown for celery workers spawned by supervisord inside a docker container.
Supervisord part
supervisord.conf
...
[supervisord]
...
nodaemon=true # run supervisord in the foreground
[include]
files=celery.conf # path to the celery config file
Set nodaemon=true so that we can start it as a background process from the entrypoint script later.
celery.conf
[group:celery_workers]
programs=one, two
[program:one]
...
command=celery -A backend --config=celery.py worker -n worker_one --pidfile=/var/log/celery/worker_one.pid --pool=gevent --concurrency=10 --loglevel=INFO
killasgroup=true
stopasgroup=true
stopsignal=TERM
stopwaitsecs=600
[program:two]
...
# similar to the previous one
The configuration file above is responsible for starting a group of workers each running in a separate process within a group. I'd like to stop on a stopwaitsecs section value. Let's see what the documentation tells us about it:
This parameter sets the number of seconds to wait for the OS to return
a SIGCHLD to supervisord after the program has been sent a
stopsignal. If this number of seconds elapses before supervisord
receives a SIGCHLD from the process, supervisord will attempt to kill
it with a final SIGKILL.
If stopwaitsecs>stop_grace_period specified for your service in a docker-compose file then you'll be getting SIGKILL from your docker. Make sure
stopwaitsecs<stop_grace_period, otherwise all running tasks get interrupted by docker.
Entrypoint script part
entrypoint.sh
#!/bin/bash
# safety switch, exit script if there's error.
set -e
on_close(){
echo "Signal caught..."
echo "Supervisor is stopping processes gracefully..."
# cleanup all pid files
rm worker_one.pid
rm worker_two.pid
supervisorctl stop celery_workers:
echo "All processes have been stopped. Exiting..."
exit 1
}
start_supervisord(){
supervisord -c /etc/supervisor/supervisord.conf
}
# start trapping signals (docker sends `SIGTERM` for shutdown)
trap on_close SIGINT SIGTERM SIGKILL
start_supervisord & # start supervisord in a background
SUPERVISORD_PID=$! # PID of the last background process started
wait $SUPERVISORD_PID
EXIT_STATUS=$? # the exit status of the last command executed
The script above consists of:
registering a cleanup function on_close
starting supervisord's process group in a background
registering the last background process's PID and waiting for it to finish
Docker part
docker-compose.yml
...
services:
celery:
...
stop_grace_period: 15m30s
entrypoint: [/entrypoints/entrypoint.sh]
The only setting worth mentioning here is entrypoint form declaration. In our case better to use exec form. It starts an executable script in a process with PID 1 and doesn't create any subprocesses as shell form does. SIGTERM from docker stop <container> gets propagated to an executable which traps it and performs all cleaning and closing logic.

Try using this:
docker-compose down

How do I restart airflow webserver?

I am using airflow for my data pipeline project. I have configured my project in airflow and start the airflow server as a backend process using following command
airflow webserver -p 8080 -D True
Server running successfully in backend. Now I want to enable authentication in airflow and done configuration changes in airflow.cfg, but authentication functionality is not reflected in server. when I stop and start airflow server in my local machine it works.
So How can I restart my daemon airflow webserver process in my server??

I advice running airflow in a robust way, with auto-recovery with systemd
so you can do:
- to start systemctl start airflow
- to stop systemctl stop airflow
- to restart systemctl restart airflow
For this you'll need a systemd 'unit' file.
As a (working) example you can use the following:
put it in /lib/systemd/system/airflow.service
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
PIDFile=/run/airflow/webserver.pid
EnvironmentFile=/home/airflow/airflow.env
User=airflow
Group=airflow
Type=simple
ExecStart=/bin/bash -c 'export AIRFLOW_HOME=/home/airflow ; airflow webserver --pid /run/airflow/webserver.pid'
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=42s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
P.S: change AIRFLOW_HOME to where your airflow folder with the config

Can you check $AIRFLOW_HOME/airflow-webserver.pid for the process id of your webserver daemon?
Then pass it a kill signal to kill it
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
Then clear the pid file
cat /dev/null > $AIRFLOW_HOME/airflow-webserver.pid
Then just run
airflow webserver -p 8080 -D True
to restart the daemon.

This worked for me (multiple times! :D )
find the process id: (assuming 8080 is the port)
lsof -i tcp:8080
kill it
kill <pid>

Use Airflow webserver's (gunicorn) signal handling
Airflow uses gunicorn as it's HTTP server, so you can send it standard POSIX-style signals. A signal commonly used by daemons to restart is HUP.
You'll need to locate the pid file for the airflow webserver daemon in order to get the right process id to send the signal to. This file could be in $AIRFLOW_HOME or also /var/run, which is where you'll find a lot of pids.
Assuming the pid file is in /var/run, you could run the command:
cat /var/run/airflow-webserver.pid | xargs kill -HUP
gunicorn uses a preforking model, so it has master and worker processes. The HUP signal is sent to the master process, which performs these actions:
HUP: Reload the configuration, start the new worker processes with a new configuration and gracefully shutdown older workers. If the application is not preloaded (using the preload_app option), Gunicorn will also load the new version of it.
More information in the gunicorn signal handling docs.
This is mostly an expanded version of captaincapsaicin's answer, but using HUP (SIGHUP) instead of KILL (SIGKILL) to reload the process instead of actually killing it and restarting it.

In my case i want to kill previous airflow process and start.
for that following command did the magic
killall -9 airflow

As the question was related to webserver, this is something that worked in my case:
systemctl restart airflow-webserver

Just run:
airflow webserver -p 8080 -D

Find pid with:
airflow webserver
will give: "The webserver is already running under PID 21250."
Than kill web server process with:
kill 21250

None of these worked for me. I had to delete the $AIRFLOW_HOME/airflow-webserver.pid file and then running airflow webserver worked.

Create a init script and use the command "daemon" to run this as service.
daemon --user="${USER}" --pidfile="${PID_FILE}" airflow webserver -p 8090 >> "${LOG_FILE}" 2>&1 &

The recommended approach is to create and enable the airflow webserver as a service. If you named the webserver as 'airflow-webserver', run the following command to restart the service:
systemctl restart airflow-webserver
You can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow services (webserver, scheduler, worker) which are enabled at startup.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace: https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V

Just by killing processes!!
Assuming the default airflow home directory is ~/airflow/
List the 3 parent processes running the airflow (PID):
cat ~/airflow/airflow-scheduler.pid
cat ~/airflow/airflow-webserver.pid
cat ~/airflow/airflow-webserver-monitor.pid
Get their PGID using:
ps -xjf
And finally run loop to kill all tree of each parent (PID):
for child in $(ps x -o "%P %p %r"| awk '{ if ( $1 == $your_first_PID || $3 == $your_first_PGID) { print $2 }}'); do kill $child; done

To restart Airflow you need to restart Airflow webserver and Airflow scheduler.
Check if Airflow servers are running:
ps -aux | grep airflow
if you see in list of running processes entries like:
ubuntu 49601 0.1 1.6 266668 135520 ? S 12:19 0:00 [ready] gunicorn: worker [airflow-webserver]
This means that Airflow webserver is running.
If you see entries like this:
ubuntu 49653 0.6 2.3 308912 187596 ? S 12:19 0:00 airflow scheduler -- DagFileProcessorManager
That means that Airflow scheduler is running.
Stop Airflow servers (webserver and scheduler):
pkill -f "airflow scheduler"
pkill -f "airflow webserver"
Now use again ps -aux | grep airflow to check if they are really shut down.
Start Airflow servers in background (daemon):
airflow webserver -D
airflow scheduler -D

How do I restart gunicorn hup , i dont know masterpid or location of PID file

I want to restart a Django server which is running using gunicorn.
I know how to use gunicorn in my system. But now I need to restart a remote server which is not set up by me.
I don't know masterpid to restart the server how can I get the masterPID.
Usually I HUP gunicorn with sudo kill -s HUP masterpid.
I tried with ps aux|grep gunicorn
and I did not find the gunicorn.pid file anywhere.
How can I get the masterpid?

the one liner below, gets the job perfectly done:
kill -HUP `ps -C gunicorn fch -o pid | head -n 1`
Explanation
pc -C gunicorn only lists the processes with gunicorn command, i.e., workers and master process. Workers are children of master as can be seen using ps -C gunicorn fc -o ppid,pid,cmd. We only need the pid of the master, therefore h flag is used to remove the first line which is PID text. Note that, f flag assures that master is printed above workers.
The correct procedure is to send HUP signal only to the master. In this way gunicorn is gracefully restarted, only the workers, not master, are recreated.

You can run gunicorn with option '-p', so you can get the pid of the master process from the pid file.
For example:
gunicorn -p app.pid your_app.wsgi.app
You can get the pid of the master by:
cat app.pid

This should also work to restart gunicorn:
ps aux |grep gunicorn |grep yourapp | awk '{ print $2 }' |xargs kill -HUP

Step 1:
Go to /etc/systemd/system/gunicorn.service and open file
add bellow line
PIDFile=/run/gunicorn/gunicorn.pid
--pid /run/gunicorn/gunicorn.pid
Example:
[Service]
PIDFile=/run/gunicorn/gunicorn.pid
WorkingDirectory=/home/django/django_project
ExecStart=/usr/bin/gunicorn --pid /run/gunicorn/gunicorn.pid --name=django_project.....
User=django
Group=django
Step 2:
Go to /etc/tmpfiles.d/ and create new file gunicorn.conf if not exist
add Bellow line
d /run/gunicorn 0755 django django -
where django = user and group name
Step 3:
Reboot your server or /etc/init.d/gunicorn restart to restart gunicorn to take effect
your pid file location is /run/gunicorn/gunicorn.pid check now..

Building on krizex's answer answer, when your master pid is stored in a file, you can gracefully reload your app in one command like this
$ cat app.pid |xargs kill -HUP
I would have liked to comment on the answer itself but I don't have enough reputation to comment yet 😢.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

H13 (Connection closed without response) errors on Heroku scale down - django

Related

Symfony Messenger not shutting down gracefully when using APP_ENV=prod

Thousands of extraneous gunicorn workers

docker-compose and graceful Celery shutdown

How do I restart airflow webserver?

How do I restart gunicorn hup , i dont know masterpid or location of PID file

Categories

Resources