I have two postgres servers running on CentOS 7 with repmgr 4.1.0-1. So far I have automated the process of promoting the standby to primary after the primary server fails but when it comes back they both act as primary and I don't think the follow_command from repmgr.conf is executed. I can do it manually by deleting the data folder and clone it from the new primary server, and then register it as standby.
repmgr.conf on server 1
node_id=1
node_name=pgdb1
conninfo='host=192.168.0.105 user=repmgr dbname=repmgr'
pg_bindir=/usr/pgsql-9.6/bin/
master_response_timeout=5
reconnect_attempts=2
reconnect_interval=2
failover=automatic
promote_command='/usr/pgsql-9.6/bin/repmgr standby promote -f /var/lib/pgsql/9.6/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-9.6/bin/repmgr standby follow -f /var/lib/pgsql/9.6/repmgr/repmgr.conf --log-to-file --upstream-node-id=2'
data_directory='/var/lib/pgsql/9.6/data'
log_file='/var/log/repmgr/repmgr.log'
log_level=DEBUG
service_start_command = 'sudo systemctl start postgresql-9.6'
service_stop_command = 'sudo systemctl stop postgresql-9.6'
service_restart_command = 'sudo systemctl restart postgresql-9.6'
service_reload_command = 'sudo systemctl reload postgresql-9.6'
repmgr.conf on server 2
node_id=2
node_name=pgdb2
conninfo='host=192.168.0.106 user=repmgr dbname=repmgr'
pg_bindir=/usr/pgsql-9.6/bin/
master_response_timeout=5
reconnect_attempts=2
reconnect_interval=2
failover=automatic
promote_command='/usr/pgsql-9.6/bin/repmgr standby promote -f /var/lib/pgsql/9.6/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-9.6/bin/repmgr standby follow -f /var/lib/pgsql/9.6/repmgr/repmgr.conf --log-to-file --upstream-node-id=1'
data_directory='/var/lib/pgsql/9.6/data'
log_file='/var/log/repmgr/repmgr.log'
log_level=DEBUG
service_start_command = 'sudo systemctl start postgresql-9.6'
service_stop_command = 'sudo systemctl stop postgresql-9.6'
service_restart_command = 'sudo systemctl restart postgresql-9.6'
service_reload_command = 'sudo systemctl reload postgresql-9.6'
After the server starts again it connects to itself and resumes monitoring. Here is the log
[2018-08-16 21:29:56] [DEBUG] connecting to: "user=repmgr dbname=repmgr host=192.168.0.105 connect_timeout=2 fallback_application_name=repmgr"
[2018-08-16 21:29:56] [NOTICE] reconnected to primary node after 22 seconds, resuming monitoring
[2018-08-16 21:31:33] [INFO] monitoring primary node "pgdb1" (node ID: 1) in normal state
Is there a way to automate the primary to switch to standby when it starts again or the original standby that was promoted to go back to standby? Or maybe I can redirect to a script that'll do that from the follow_command, for example:
follow_command='change-to-standby.sh'
I will appreciate any help.
Related
We are using Symfony Messenger in combination with supervisor running in a Docker container on AWS ECS. We noticed the worker is not shut down gracefully. After debugging it appears it does work as expected when using APP_ENV=dev, but not when APP_ENV=prod.
I made a simple sleepMessage, which sleeps for 1 second and then prints a message for 60 seconds. This is when running with APP_ENV=dev
As you can see it's clearly waiting for the program to stop running.
Now with APP_ENV=prod:
It stops immediately without waiting.
In the Dockerfile we have configured the following to start supervisor. It's based on php:8.1-apache, so that's why STOPSIGNAL has been configured
RUN apt-get update && apt-get install -y --no-install-recommends \
# for supervisor
python \
supervisor
The start-worker.sh script contains this
#!/usr/bin/env bash
cp config/worker/messenger-worker.conf ../../../etc/supervisor/supervisord.conf
exec /usr/bin/supervisord
We do this because certain env variables are only available when starting up.
For debugging purposes the config has been hardcoded to test.
Below is the messenger-worker.conf
[unix_http_server]
file=/tmp/supervisor.sock
[supervisord]
nodaemon=true ; start in foreground if true; default false
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[program:messenger-consume]
stderr_logfile_maxbytes=0
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
command=bin/console messenger:consume async -vv --env=prod --time-limit=129600
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
numprocs=1
environment=
MESSENGER_TRANSPORT_DSN="https://sqs.eu-central-1.amazonaws.com/{id}/dev-
symfony-messenger-queue"
So in short, when using --env=prod in the config above it doesn't wait for the worker to stop, while with --env=dev it does. Does anybody know how to solve this?
I don't know why there would be a difference between dev & prod environment but it seems you have no grace period set (at least for Supervisor). As I added in the docs:
the workers will be able to handle the SIGTERM signal if you have the PCNTL PHP extension
you need to add stopwaitsecs to your Supervisor program configuration
As you use Docker too, you can also set the graceful period at the service level which defaults to 10s:
services:
my_app:
stop_grace_period: 20s
# ...
With this configuration, running docker-compose down (just an example):
Docker sends a SIGTERM signal to the service entrypoint (Supervisor) and waits 20s for it to exit
Supervisor sends a SIGTERM signal to its programs (messenger:consume commands) and waits 20s for them to exit
the messenger:consume processes will "catch" the signal, finish handling the current message and stop
every program stopped, Supervisor can stop, then the Docker Compose stack
Turns out it was related to the wait_time option related to SQS transports. It probably caused a request that was started just before the container exited and was sent back when the container did not exist anymore. So, wait_time to 0 fixed that problem.
Then there was this which could lead to the same issue
I am trying to install OpenCPU 2.1 on a fresh free tier AWS server. I followed
[https://aws.amazon.com/getting-started/tutorials/?awsf.getting-started-content=use-case-tmt%23websites-apps], and launched a
Ubuntu Server 18.04 LTS (HVM), x86, free tier server, and obtained
public and private IP's. Then I follow [https://opencpu.github.io/server-manual/opencpu-server.pdf], section 2.2
sudo apt-get update
sudo apt-get upgrade
I respond to grub prompt to keep local version installed
sudo add-apt-repository ppa:opencpu/opencpu-2.1 -y
sudo apt-get update
sudo apt-get install opencpu-server
and OK to defaults on pop-up on mailname and smarthost prompts.
The results all look OK. The last section reads:
To activate the new configuration, you need to run:
systemctl restart apache2
Enabling opencpu in apache...
Reloading apparmor...
Restarting apache...
Installation done!
Setting up libxml-twig-perl (1:3.50-1) ...
Setting up libnet-dbus-perl (1.1.0-4build2) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Processing triggers for systemd (237-3ubuntu10.13) ...
Processing triggers for ureadahead (0.100.0-20) ...
Processing triggers for ufw (0.35-5) ...
When I then try to point my browser to http(s)://your.server.com/ocpu (of course with the the IP replaced by public IP I got from AWS, and using either http:// or https://), then I get a time-out in the browser window after a minute or so.
Checking sudo systemctl status apache2.service provides
● apache2.service - The Apache HTTP Server
Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor preset: enabled)
Drop-In: /lib/systemd/system/apache2.service.d
└─apache2-systemd.conf
Active: active (running) since Thu 2019-02-28 09:41:19 UTC; 1min 14s ago
Process: 30750 ExecStop=/usr/sbin/apachectl stop (code=exited, status=0/SUCCESS)
Process: 30755 ExecStart=/usr/sbin/apachectl start (code=exited, status=0/SUCCESS)
Main PID: 30771 (apache2)
Tasks: 6 (limit: 1152)
CGroup: /system.slice/apache2.service
├─30771 /usr/sbin/apache2 -k start
├─30773 /usr/sbin/apache2 -k start
├─30774 /usr/sbin/apache2 -k start
├─30775 /usr/sbin/apache2 -k start
├─30776 /usr/sbin/apache2 -k start
└─30777 /usr/sbin/apache2 -k start
Feb 28 09:41:19 ip-zzz-zz-zz-zz systemd[1]: Stopped The Apache HTTP Server.
Feb 28 09:41:19 ip-zzz-zzz-zz-zz systemd[1]: Starting The Apache HTTP Server...
Feb 28 09:41:19 ip-zzz-zzz-zz-zz systemd[1]: Started The Apache HTTP Server.
which seems OK. Also, trying a restart:
sudo a2ensite opencpu
Site opencpu already enabled
does not activate the welcome page. Is there something else that needs to be activated or set?
First try to connect to the server locally to test if it running. On the server run:
curl --insecure http://localhost/ocpu/info
If you get a response with some info about the server, opencpu is running and the issue is likely the amazon security group is blocking HTTP traffic. See the section below on how to enable this.
On the other hand if the curl command above did not work (it gave a timeout erorr), there is a problem with the server and you need to check /var/log/apache2/error.log.
Enable HTTP(S) in the amazon security group (firewall)
If you still not connect from your browser, the issue is likely that your have not opened the http ports in your EC2 firewall (security group). To check this, open the EC2 management console in your browser and lookup the security group is associated with your EC2 instance. Then add inbound rules to this secruity group to allow port 80 and 443 from any host.
First lookup the security group that is associated with your instance:
And then add an inbound rule to allow port 80 (HTTP) and 443 (HTTPS):
What is ideal ansible way to do a apache graceful restart?
- name: Restart Apache gracefully
command: apachectl -k graceful
Ansible systemd module does the same? If not, what is the difference? Thanks !
- name: Restart apache service.
systemd:
name: apache2
daemon_reload: yes
state: restarted
What you can do with Ansible is to ensure that all established connections to Apache are closed (drained in Ansible lingo).
Use the wait_for module with the condition to wait for drained connections on the particular host and port, with the state set to drained. See below:
- name: wait until apache2 connections are drained.
wait_for:
host: 0.0.0.0
port: 80
state: drained
Note: You can use this for all your Linux network services, which becomes very handy if you want to shutdown services in a particular order in your Ansible playbook.
The wait_for directive is useful to ensuring that Ansible does not run your playbook until specific steps are completed.
There is not support of graceful state at this moment in service or systemd modules because this is quite specific to certain services, status is limited to started, stopped, restarted reloaded and running.
So now you need to use a command module as you wrote in the question to perform a graceful restart, this is the only proper solution.
However there is an issue to support custom status, perhaps someone will implement that soon.
The documentation for the Ansible service module is not clearly stating what "reloaded" state does but, I found that for standard Red Hat 7 install using service module "reloaded" state results in a graceful restart.
I was led to this solution by this Server Fault QA
You can verify by getting process list of the httpd processes prior running your playbook which triggers your handler.
ps -ef | grep httpd | grep -v grep
After your playbook runs and handler reloaded state for httpd service shows "changed", re-examine the process list again.
You should see the start times for all the child httpd (non-root) processes have updated while the root owned parent process's start time has stayed the same.
If you also look in the error log you should see an entry containing:
"... configured -- resuming normal operations ... "
And, finally, you can see this by examining the output of systemctl status for the httpd.service and see the apachectl graceful option was called:
sudo systemctl status httpd.service
My handler now looks like:
- name: "{{ service_name }} restart handler"
become: yes
ansible.builtin.service:
service: "{{ service_name }}"
# state: restarted
state: reloaded
I have a startup script set by an instance template that initializes the server for google compute. After installing postgres, I manually call for it to start using :
/etc/init.d/postgresql start
This completes successfully, but the server is not listening on 5432 when run by the startup script (postgres isn't started, although that service start call completes successfully). After startup completes, and I log in, I can do it successfully. Anyone know why that won't work within the startup script ? I need to load up data during startup so I need to startup postgres during initialization.
I Solved using newer debian image
I had the same problem as you (installing postgresql in GCE startup-script results in the package being installed, but the server is not running), and I think I figured out the root cause.
Normally, the postgresql-11 package is supposed to start the PostgreSQL server after installation. Here is a snippet from its postinst script:
if [ "$1" = configure ]; then
. /usr/share/postgresql-common/maintscripts-functions
configure_version $VERSION "$2"
fi
Taking a look at /usr/share/postgresql-common/maintscripts-functions, we see:
configure_version() {
...
# reload systemd to let the generator pick up the new unit
if [ -d /run/systemd/system ]; then
systemctl daemon-reload
fi
invoke-rc.d postgresql start $VERSION # systemd: argument ignored, starts all versions
}
My debian installation comes with init-system-helpers version "1.56+nmu1", which contains this bit of code in invoke-rc.d:
# avoid deadlocks during bootup and shutdown from units/hooks
# which call "invoke-rc.d service reload" and similar, since
# the synchronous wait plus systemd's normal behaviour of
# transactionally processing all dependencies first easily
# causes dependency loops
if ! systemctl --quiet is-active multi-user.target; then
sctl_args="--job-mode=ignore-dependencies"
fi
case $saction in
start|restart|try-restart)
[ "$_state" != "LoadState=masked" ] || exit 0
systemctl $sctl_args "${saction}" "${UNIT}" && exit 0
;;
The debian postgresql-11 package makes use of templated systemd units. The main one is called postgresql.service but this is a dummy service that doesn't actually do anything. The PostgreSQL server is actually started by a templated unit named postgresql#11-main which is usually started alongside the main service because it has ReloadPropagatedFrom=postgresql.service.
Note that when this issue occurs, the main unit is started but the templated one is not:
$ sudo systemctl status postgresql
● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
Active: active (exited) since Fri 2021-04-02 05:40:48 UTC; 32min ago
Main PID: 1663 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4665)
Memory: 0B
CGroup: /system.slice/postgresql.service
Apr 02 05:40:48 hubnext-west-r21r systemd[1]: Starting PostgreSQL RDBMS...
Apr 02 05:40:48 hubnext-west-r21r systemd[1]: Started PostgreSQL RDBMS.
$ sudo systemctl status postgresql#11-main
● postgresql#11-main.service - PostgreSQL Cluster 11-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; enabled-runtime; vendor preset: enabled)
Active: inactive (dead)
That's because when --job-mode=ignore-dependencies is specified, this link is ignored.
The GcE startup script runs as a systemd unit, which starts before multi-user.target is up:
$ find /etc/systemd | grep startup
/etc/systemd/system/multi-user.target.wants/google-startup-scripts.service
Therefore, invoke-rc.d notices that systemctl --quiet is-active multi-user.target is false and adds --job-mode=ignore-dependencies, which results in the PostgreSQL server not starting.
One possible workaround is explicitly running systemd start postgresql#11-main.service from your startup script after installing postgres.
By the way, I noticed that a recent commit (Nov 2020) changed this invoke-rc.d behavior so that it no longer uses --job-mode=ignore-dependencies. That would help avoid this issue.
I am using airflow for my data pipeline project. I have configured my project in airflow and start the airflow server as a backend process using following command
airflow webserver -p 8080 -D True
Server running successfully in backend. Now I want to enable authentication in airflow and done configuration changes in airflow.cfg, but authentication functionality is not reflected in server. when I stop and start airflow server in my local machine it works.
So How can I restart my daemon airflow webserver process in my server??
I advice running airflow in a robust way, with auto-recovery with systemd
so you can do:
- to start systemctl start airflow
- to stop systemctl stop airflow
- to restart systemctl restart airflow
For this you'll need a systemd 'unit' file.
As a (working) example you can use the following:
put it in /lib/systemd/system/airflow.service
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
PIDFile=/run/airflow/webserver.pid
EnvironmentFile=/home/airflow/airflow.env
User=airflow
Group=airflow
Type=simple
ExecStart=/bin/bash -c 'export AIRFLOW_HOME=/home/airflow ; airflow webserver --pid /run/airflow/webserver.pid'
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=42s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
P.S: change AIRFLOW_HOME to where your airflow folder with the config
Can you check $AIRFLOW_HOME/airflow-webserver.pid for the process id of your webserver daemon?
Then pass it a kill signal to kill it
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
Then clear the pid file
cat /dev/null > $AIRFLOW_HOME/airflow-webserver.pid
Then just run
airflow webserver -p 8080 -D True
to restart the daemon.
This worked for me (multiple times! :D )
find the process id: (assuming 8080 is the port)
lsof -i tcp:8080
kill it
kill <pid>
Use Airflow webserver's (gunicorn) signal handling
Airflow uses gunicorn as it's HTTP server, so you can send it standard POSIX-style signals. A signal commonly used by daemons to restart is HUP.
You'll need to locate the pid file for the airflow webserver daemon in order to get the right process id to send the signal to. This file could be in $AIRFLOW_HOME or also /var/run, which is where you'll find a lot of pids.
Assuming the pid file is in /var/run, you could run the command:
cat /var/run/airflow-webserver.pid | xargs kill -HUP
gunicorn uses a preforking model, so it has master and worker processes. The HUP signal is sent to the master process, which performs these actions:
HUP: Reload the configuration, start the new worker processes with a new configuration and gracefully shutdown older workers. If the application is not preloaded (using the preload_app option), Gunicorn will also load the new version of it.
More information in the gunicorn signal handling docs.
This is mostly an expanded version of captaincapsaicin's answer, but using HUP (SIGHUP) instead of KILL (SIGKILL) to reload the process instead of actually killing it and restarting it.
In my case i want to kill previous airflow process and start.
for that following command did the magic
killall -9 airflow
As the question was related to webserver, this is something that worked in my case:
systemctl restart airflow-webserver
Just run:
airflow webserver -p 8080 -D
Find pid with:
airflow webserver
will give: "The webserver is already running under PID 21250."
Than kill web server process with:
kill 21250
None of these worked for me. I had to delete the $AIRFLOW_HOME/airflow-webserver.pid file and then running airflow webserver worked.
Create a init script and use the command "daemon" to run this as service.
daemon --user="${USER}" --pidfile="${PID_FILE}" airflow webserver -p 8090 >> "${LOG_FILE}" 2>&1 &
The recommended approach is to create and enable the airflow webserver as a service. If you named the webserver as 'airflow-webserver', run the following command to restart the service:
systemctl restart airflow-webserver
You can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow services (webserver, scheduler, worker) which are enabled at startup.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace: https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
Just by killing processes!!
Assuming the default airflow home directory is ~/airflow/
List the 3 parent processes running the airflow (PID):
cat ~/airflow/airflow-scheduler.pid
cat ~/airflow/airflow-webserver.pid
cat ~/airflow/airflow-webserver-monitor.pid
Get their PGID using:
ps -xjf
And finally run loop to kill all tree of each parent (PID):
for child in $(ps x -o "%P %p %r"| awk '{ if ( $1 == $your_first_PID || $3 == $your_first_PGID) { print $2 }}'); do kill $child; done
To restart Airflow you need to restart Airflow webserver and Airflow scheduler.
Check if Airflow servers are running:
ps -aux | grep airflow
if you see in list of running processes entries like:
ubuntu 49601 0.1 1.6 266668 135520 ? S 12:19 0:00 [ready] gunicorn: worker [airflow-webserver]
This means that Airflow webserver is running.
If you see entries like this:
ubuntu 49653 0.6 2.3 308912 187596 ? S 12:19 0:00 airflow scheduler -- DagFileProcessorManager
That means that Airflow scheduler is running.
Stop Airflow servers (webserver and scheduler):
pkill -f "airflow scheduler"
pkill -f "airflow webserver"
Now use again ps -aux | grep airflow to check if they are really shut down.
Start Airflow servers in background (daemon):
airflow webserver -D
airflow scheduler -D