GCloud compute instance halting but not stopping - google-cloud-platform

I want to run a gcloud compute instance only for the duration of my startup script. My startup script calls a nodejs script which spawns sudo halt or shutdown -h 0. Observing the serial port the system comes to a halt but remains in the RUNNING state, never going into STOPPING or TERMINATED:
Starting Halt...
[[0;32m OK [0m] Stopped Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
Stopping LVM2 metadata daemon...
[[0;32m OK [0m] Stopped LVM2 metadata daemon.
[ 34.560467] reboot: System halted
How is it possible that the system can halt completely but the instance doesn't register as such?

As #DazWilkin said in the comments, you can use poweroff to stop a VM instance in Compute Engine. Alternatively, per the official Compute Engine documentation you can also run sudo shutdown -h now or sudo poweroff which will both accomplish what you are trying to do.

Related

Symfony Messenger not shutting down gracefully when using APP_ENV=prod

We are using Symfony Messenger in combination with supervisor running in a Docker container on AWS ECS. We noticed the worker is not shut down gracefully. After debugging it appears it does work as expected when using APP_ENV=dev, but not when APP_ENV=prod.
I made a simple sleepMessage, which sleeps for 1 second and then prints a message for 60 seconds. This is when running with APP_ENV=dev
As you can see it's clearly waiting for the program to stop running.
Now with APP_ENV=prod:
It stops immediately without waiting.
In the Dockerfile we have configured the following to start supervisor. It's based on php:8.1-apache, so that's why STOPSIGNAL has been configured
RUN apt-get update && apt-get install -y --no-install-recommends \
# for supervisor
python \
supervisor
The start-worker.sh script contains this
#!/usr/bin/env bash
cp config/worker/messenger-worker.conf ../../../etc/supervisor/supervisord.conf
exec /usr/bin/supervisord
We do this because certain env variables are only available when starting up.
For debugging purposes the config has been hardcoded to test.
Below is the messenger-worker.conf
[unix_http_server]
file=/tmp/supervisor.sock
[supervisord]
nodaemon=true ; start in foreground if true; default false
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[program:messenger-consume]
stderr_logfile_maxbytes=0
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
command=bin/console messenger:consume async -vv --env=prod --time-limit=129600
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
numprocs=1
environment=
MESSENGER_TRANSPORT_DSN="https://sqs.eu-central-1.amazonaws.com/{id}/dev-
symfony-messenger-queue"
So in short, when using --env=prod in the config above it doesn't wait for the worker to stop, while with --env=dev it does. Does anybody know how to solve this?
I don't know why there would be a difference between dev & prod environment but it seems you have no grace period set (at least for Supervisor). As I added in the docs:
the workers will be able to handle the SIGTERM signal if you have the PCNTL PHP extension
you need to add stopwaitsecs to your Supervisor program configuration
As you use Docker too, you can also set the graceful period at the service level which defaults to 10s:
services:
my_app:
stop_grace_period: 20s
# ...
With this configuration, running docker-compose down (just an example):
Docker sends a SIGTERM signal to the service entrypoint (Supervisor) and waits 20s for it to exit
Supervisor sends a SIGTERM signal to its programs (messenger:consume commands) and waits 20s for them to exit
the messenger:consume processes will "catch" the signal, finish handling the current message and stop
every program stopped, Supervisor can stop, then the Docker Compose stack
Turns out it was related to the wait_time option related to SQS transports. It probably caused a request that was started just before the container exited and was sent back when the container did not exist anymore. So, wait_time to 0 fixed that problem.
Then there was this which could lead to the same issue

aws ec2 ssh fails after creating an image of the instance

I regularly create an image of a running instance without stopping it first. That has worked for years without any issues. Tonight, I created another image of the instance (without any changes to the virtual server settings except for a "sudo yum update -y") and noticed my ssh session was closed. It looked like it was rebooted after the image was created. Then the web console showed 1/2 status checks passed. I rebooted it a few times and the status remained the same. The log showed:
Setting hostname localhost.localdomain: [ OK ]
Setting up Logical Volume Management: [ 3.756261] random: lvm: uninitialized urandom read (4 bytes read)
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
[ OK ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/xvda1
/: clean, 437670/1048576 files, 3117833/4193787 blocks
[/sbin/fsck.xfs (1) -- /mnt/pgsql-data] fsck.xfs -a /dev/xvdf
[/sbin/fsck.ext2 (2) -- /mnt/couchbase] fsck.ext2 -a /dev/xvdg
/sbin/fsck.xfs: XFS file system.
fsck.ext2: Bad magic number in super-block while trying to open /dev/xvdg
/dev/xvdg:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
[ 3.811304] random: crng init done
filesystem. If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
or
e2fsck -b 32768 <device>
[FAILED]
*** An error occurred during the file system check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell.
/dev/fd/9: line 2: plymouth: command not found
Give root password for maintenance
(or type Control-D to continue):
It looked like /dev/xvdg failed the disk check. I detached the volume from the instance and reboot. I still couldn't ssh in. I re-attached it and rebooted. Now it says status check 2/2 passed but I still can't ssh back in and the log still showed issues with /dev/xvdg as above.
Any help would be appreciated. Thank you!
Thomas

AWS - EC 2 All of sudden lost access due to No /sbin/init, trying fallback

In my AWS EC2 instance, was locked and lost access from December 6th for an unknown reason, it cannot be an action i did on the EC2, because i was overseas on holidays from December 01st and Came back January 01st, I realized server was lost connection from 6t December and i have no way to connect to the EC2 now on,
EC2 runs on CENTOS 7 and PHP, NGINX, SSHD setup.
When i checked the System Log i see below.
[[32m OK [0m] Started Cleanup udevd DB.
[[32m OK [0m] Reached target Switch Root.
Starting Switch Root...
[ 6.058942] systemd-journald[99]: Received SIGTERM from PID 1 (systemd).
[ 6.077915] systemd[1]: No /sbin/init, trying fallback
[ 6.083729] systemd[1]: Failed to execute /bin/sh, giving up: No such file or directory
[ 180.596117] random: crng init done
Any idea on what is the issue will be much appreciated
In Brief, i had to following to recover, The root cause has been that the disk was completely full.
) Problem mounting the slaved volume (xfs_admin)
) Not able to chroot the environment (ln -s)
) Disk at 100% (df -h) Removing var/log files
) Rebuilt the initramfs (dracut -f)
) Rename the etc/fstab
) Switched the Slave volume back to original UUID (xfs_admin)
) Configured the Grub to boot the latest version of the kernel/initramfs
) Rebuilt Initramfs and Grub

EC2, 16.04, Systemd, Supervisord, & Python

I have a service written in python 2.7 and managed by supervisord on an Ubuntu 16.04 EC2 spot instance.
On system startup I have a number of systemd tasks that need to take place and finish prior to supervisord starting the service.
When the instance is about to shutdown, I need supervisord to capture the event and tell the service to gracefully halt. The service will need to stop processing and return any workloads to the queue prior to exiting gracefully.
What would be the optimal way to manage system startup in this scenario?
What would be the optimal way to manage system shutdown in this scenario?
How do I best handle the interaction between supervisord and the service?
First, we need to install a systemd task that we want to run prior to supervisor starting up. Let's create a script, /usr/bin/pre-supervisor.sh, that will handle performing that work for us and create the /lib/systemd/system/pre-supervisor.service for systemd.
[Unit]
Description=Task to run prior to supervisor Starting up
After=cloud-init.service
Before=supervisor.service
Requires=cloud-init.service
[Service]
Type=oneshot
WorkingDirectory=/usr/bin
ExecStart=/usr/bin/pre-supervisor.sh
RemainAfterExit=no
TimeoutSec=90
User=ubuntu
# Output needs to appear in instance console output
StandardOutput=journal+console
[Install]
WantedBy=multi-user.target
As you can see, this will run after the ec2 cloud-init.service completes, and prior to the supervisor.service.
Next, let us modify the /lib/systemd/system/supervisor.service to run After the pres-supervisor.service completes, instead of after network.target.
[Unit]
Description=Supervisor process control system for UNIX
Documentation=http://supervisord.org
After=pre-supervisor.service
[Service]
ExecStart=/usr/bin/supervisord -n -c /etc/supervisor/supervisord.conf
ExecStop=/usr/bin/supervisorctl $OPTIONS shutdown
ExecReload=/usr/bin/supervisorctl -c /etc/supervisor/supervisord.conf $OPTIONS reload
KillMode=process
Restart=on-failure
RestartSec=50s
[Install]
WantedBy=multi-user.target
That will ensure that our pre-supervisor tasks run prior to supervisor starting up.
Because these are spot instances, AWS has exposed the termination notice in the meta-data url, I simply need to inject something like:
if requests.get("http://169.254.169.254/latest/meta-data/spot/termination-time").status_code == 200
into my python service, have it check every five seconds or so, and gracefully shutdown as soon as the termination notice appears.

Get VirtualBox server booting status

I have a database server that is started from a VirtualBox VM. I can start my development only after the dataBase VM boots up. How can I check whether the dataBase server has been booted up successfully. I am well aware of the command VBoxManage showvminfo
but it shows the State as 'running' itself even when the VM is still booting up.
Is there a way to check the booting status?
I suspect not, as even the Vagrant project seems to use a constantly-retrying SSH connection with a large timeout to determine if the machine is ready.