aws ec2 ssh fails after creating an image of the instance - amazon-web-services

I regularly create an image of a running instance without stopping it first. That has worked for years without any issues. Tonight, I created another image of the instance (without any changes to the virtual server settings except for a "sudo yum update -y") and noticed my ssh session was closed. It looked like it was rebooted after the image was created. Then the web console showed 1/2 status checks passed. I rebooted it a few times and the status remained the same. The log showed:
Setting hostname localhost.localdomain: [ OK ]
Setting up Logical Volume Management: [ 3.756261] random: lvm: uninitialized urandom read (4 bytes read)
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
[ OK ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/xvda1
/: clean, 437670/1048576 files, 3117833/4193787 blocks
[/sbin/fsck.xfs (1) -- /mnt/pgsql-data] fsck.xfs -a /dev/xvdf
[/sbin/fsck.ext2 (2) -- /mnt/couchbase] fsck.ext2 -a /dev/xvdg
/sbin/fsck.xfs: XFS file system.
fsck.ext2: Bad magic number in super-block while trying to open /dev/xvdg
/dev/xvdg:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
[ 3.811304] random: crng init done
filesystem. If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
or
e2fsck -b 32768 <device>
[FAILED]
*** An error occurred during the file system check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell.
/dev/fd/9: line 2: plymouth: command not found
Give root password for maintenance
(or type Control-D to continue):
It looked like /dev/xvdg failed the disk check. I detached the volume from the instance and reboot. I still couldn't ssh in. I re-attached it and rebooted. Now it says status check 2/2 passed but I still can't ssh back in and the log still showed issues with /dev/xvdg as above.
Any help would be appreciated. Thank you!
Thomas

Related

AWS - EC 2 All of sudden lost access due to No /sbin/init, trying fallback

In my AWS EC2 instance, was locked and lost access from December 6th for an unknown reason, it cannot be an action i did on the EC2, because i was overseas on holidays from December 01st and Came back January 01st, I realized server was lost connection from 6t December and i have no way to connect to the EC2 now on,
EC2 runs on CENTOS 7 and PHP, NGINX, SSHD setup.
When i checked the System Log i see below.
[[32m OK [0m] Started Cleanup udevd DB.
[[32m OK [0m] Reached target Switch Root.
Starting Switch Root...
[ 6.058942] systemd-journald[99]: Received SIGTERM from PID 1 (systemd).
[ 6.077915] systemd[1]: No /sbin/init, trying fallback
[ 6.083729] systemd[1]: Failed to execute /bin/sh, giving up: No such file or directory
[ 180.596117] random: crng init done
Any idea on what is the issue will be much appreciated
In Brief, i had to following to recover, The root cause has been that the disk was completely full.
) Problem mounting the slaved volume (xfs_admin)
) Not able to chroot the environment (ln -s)
) Disk at 100% (df -h) Removing var/log files
) Rebuilt the initramfs (dracut -f)
) Rename the etc/fstab
) Switched the Slave volume back to original UUID (xfs_admin)
) Configured the Grub to boot the latest version of the kernel/initramfs
) Rebuilt Initramfs and Grub

AWS-EC2 Redis-server RDB snapshot write error

I have a web application running on Laravel5.2 framework, with session driver set to redis with following AWS setup.
Instance-1: Running web application, with Redis configurations in .env file as follow
Redis-host: aws-private-ip-of-instance-2
Redis-password: NULL
Redis-port: 6379
Instance-2: Redis-server running with following configuration
Bind aws-private-ip-of-instance-2 and 127.0.0.1
Working directory /var/lib/redis with 775 permission, and ower-group is redis.
RDB snapshot name dump.rdb with 660 permission, and ower-group is redis.
NOTE: In AWS inbound rule for port 6379 is configured for
Instance-2.
Everything works fine, until redis tries to write the data on the RDB file. Following error shows on front-end.
MISCONF Redis is configured to save RDB snapshots, but is currently
not able to persist on disk. Commands that may modify the data set are
disabled. Please check Redis logs for details about the error.
While in the logs of Redis server i got following data.
4873:M 23 Sep 10:08:15.028 * 1 changes in 900 seconds. Saving...
4873:M 23 Sep 10:08:15.028 * Background saving started by pid 7392
7392:C 23 Sep 10:08:15.028 # Failed opening .rdb for saving: Read-only file system
4873:M 23 Sep 10:08:15.128 # Background saving error
Things I have tried
Add vm.overcommit_memory = 1 to /etc/sysctl.conf, as suggested in Redis-administraition-blog
Change path to dump.rdb file to tmp folder and change permissions to 777.
This other Stack Exchange thread might help, since you are using a custom /tmp dir for data:
The simple way to do this is to run systemctl edit redis. This will create an override drop-in file /etc/systemd/system/redis.service.d/override.conf, in which you can place your changes (and the proper section):
[Service]
ReadWriteDirectories=-/my/custom/data/dir
You may also create that directory and place files ending in .conf in it manually. But do not leave the directory empty, as this will disable the service.
In either case, run systemctl daemon-reload and you are ready to restart your service.
Many threads also point to filesystem inconsistency as root cause. Since you are using EC2, check this AWS forums post:
To fix this, you will have to:
Stop the instance
Detach the root volume of your instance
Attach the volume as a data volume to any running Linux instance in the same availability zone
Perform a filesystem check (fsck) on the volume and fix the issues
Detach the volume and attach it back to your instance as it's root volume
Boot back instance and verify if the volume was able to mount successfully
As a last resort, terminate the instance if possible.
Hope it helps!
Well this is very embarrassing to post answer of own question, which was a really stupid mistake. But hope new folks here learns from my mistake too.
So first thing I have done is enable detail logs for redis-server in /etc/redis/redis.conf file by changing log_level option to debug.
Observe the logs and understand that my redis port 6379 was open for everyone on internet.
So from logs I observe that someone else's server is spoofing into my redis server and making it slave of it. And as my redis server is configure in a way that slave is read-only, when i try to access my redis-server it throw error of read-only.
After applying the fire-wall for redis server port, I have not encounter this issue anymore.

"vagrant up" failing: Vagrant VM failed to remain in the running state

The command vagrant up is failing and I don't know why.
$ egrep -v '^ *(#|$)' Vagrantfile
VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "precise32"
end
$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
[default] Importing base box 'precise32'...
[default] Matching MAC address for NAT networking...
[default] Setting the name of the VM...
[default] Clearing any previously set forwarded ports...
[default] Creating shared folders metadata...
[default] Clearing any previously set network interfaces...
[default] Preparing network interfaces based on configuration...
[default] Forwarding ports...
[default] -- 22 => 2222 (adapter 1)
[default] Booting VM...
[default] Waiting for VM to boot. This can take a few minutes.
The VM failed to remain in the "running" state while attempting to boot.
This is normally caused by a misconfiguration or host system incompatibilities.
Please open the VirtualBox GUI and attempt to boot the virtual machine
manually to get a more informative error message.
$ vagrant status
Current machine states:
default poweroff (virtualbox)
The VM is powered off. To restart the VM, simply run `vagrant up`
$ VBoxManage list runningvms
$
Here are the messages in the VirtualBox log file, VBoxSVC.log:
$ cat ~/.VirtualBox/VBoxSVC.log
VirtualBox XPCOM Server 4.2.16 r86992 linux.amd64 (Jul 4 2013 16:29:59) release log
00:00:00.000499 main Log opened 2013-08-13T18:40:45.907580000Z
00:00:00.000508 main OS Product: Linux
00:00:00.000509 main OS Release: 3.6.11-4.fc16.x86_64
00:00:00.000510 main OS Version: #1 SMP Tue Jan 8 20:57:42 UTC 2013
00:00:00.000537 main DMI Product Name: X8DA3
00:00:00.000547 main DMI Product Version: 1234567890
00:00:00.000647 main Host RAM: 24103MB total, 17127MB available
00:00:00.000654 main Executable: /usr/local/VirtualBox/VBoxSVC
00:00:00.000655 main Process ID: 9417
00:00:00.000656 main Package type: LINUX_64BITS_GENERIC
00:00:00.110125 nspr-2 Loading settings file "/opt/tomcat/.VirtualBox/VirtualBox.xml" with version "1.12-linux"
00:00:00.110817 nspr-2 Failed to retrive disk info: getDiskName(/dev/md126p1) --> md126p1
00:00:00.264367 nspr-2 VDInit finished
00:00:00.275173 nspr-2 Loading settings file "/opt/tomcat/VirtualBox VMs/vagrant_getting_started_default_1376419129/vagrant_getting_started_default_1376419129.vbox" with version "1.12-linux"
00:00:05.288923 main ERROR [COM]: aRC=VBOX_E_OBJECT_IN_USE (0x80bb000c) aIID={29989373-b111-4654-8493-2e1176cba890} aComponent={Medium} aText={Medium '/opt/tomcat/VirtualBox VMs/vagrant_getting_started_default_1376419129/box-disk1.vmdk' cannot be closed because it is still attached to 1 virtual machines}, preserve=false
00:00:05.290229 Watcher ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={3b2f08eb-b810-4715-bee0-bb06b9880ad2} aComponent={VirtualBox} aText={The object is not ready}, preserve=false
$
Any advice would be greatly appreciated.
Had the same error on OSX. Restarting VirtualBox fixed it :S
sudo /Library/StartupItems/VirtualBox/VirtualBox restart
Also see: https://forums.virtualbox.org/viewtopic.php?t=5489
I solved the problem by re-installing VirtualBox and adding myself to the vboxusers group. The re-installation process printed a message indicating that VM users had to be a member of that group. I don't know if the re-installation was necessary or if being added to the group would have sufficed.
The host machine was 32bits (Ubuntu) and the guest was 64bit, I changed the guest to 32 and it solved the problem.
My understanding is that vboxusers group is related to accessing USB devices within the guest. Not sure why it is causing the issue. Normally, as a vagrant base box build guideline, audio and USB are both disabled.
As per the VirtualBox Manual => The vboxusers group
The Linux installers create the system user group vboxusers during installation. Any system user who is going to use USB devices from VirtualBox guests must be a member of that group. A user can be made a member of the group vboxusers through the GUI user/group management or at the command line with sudo usermod -a -G vboxusers username
Note that adding an active user to that group will require that user to log out and back in again. This should be done manually after successful installation of the package.
I had the same problem. It is because I did a wrong configuration on my Vagrantfile in the provider section. I had tried to make my VM machine more powerfull, with 2 cpus when i have on the machine host just one.
this often happens when you try to add more hardware to your VM machine but your host machine does not have the minimun requirements

Two-machine GDB debugging between Macs over Ethernet - transaction timed out

I am trying to debug a device driver which is crashing the kernel on a Mac using a remote machine running gdb (trying to follow the instructions here). Both machines are connected to the same network by Ethernet (same router even, and both can access the network). I have also set nvram boot-args="debug=0x144" on the target and restarted.
I then load the kernel extension on the target as usual. On the host machine I start gdb like this:
$ gdb -arch i386 /Volumes/KernelDebugKit/mach_kernel
Once in gdb, I load the kernel macros and set up for remote attachment
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach 11.22.33.44
However, the last command then does not make a connection and I get an endless spool of
kdp_reply_wait: error from kdp_receive: receive timeout exceeded
kdp_transaction (remote_connect): transaction timed out
kdp_transaction (remote_connect): re-sending transaction
What is the correct way to get gdb connected to the target machine?
There are a number of ways to break into the target, including:
Kernel panic, as stated in your answer above.
Non-maskable interrupt, which is triggered by the cmd-option-ctrl-shift-esc key combination.
Code a break in your kernel extension using PE_enter_debugger(), which is declared in pexpert/pexpert.h
Halt at boot by setting DB_HALT (0x01) in the NVRAM boot-args value.
Additionally, you may need to set a persistent ARP table entry, as the target is unable to respond to ARP requests while stopped in the debugger. I use the following in my debugger-launch shell script to set the ARP entry if it doesn't already exist:
if !(arp -a -n -i en0 | grep '10\.211\.55\.10[)] at 0:1c:42:d7:29:47 on en0 permanent' > /dev/null) ; then
echo "Adding arp entry"
sudo arp -s 10.211.55.10 00:1c:42:d7:29:47
fi
Someone more expert could probably improve on my bit of shell script.
All of the above is documented in http://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KernelProgramming/KernelProgramming.pdf.
The answer is simply to make sure the target has a kernel panic before you try to attach gdb from the host.

How to figure out why ssh session does not exit sometimes?

I have a C++ application that uses ssh to summon a connection to the server. I find that sometimes the ssh session is left lying around long after the command to summon the server has exited. Looking at the Centos4 man page for ssh I see the following:
The session terminates when the command or shell on the remote machine
exits and all X11 and TCP/IP connections have been closed. The exit
status of the remote program is returned as the exit status of ssh.
I see that the command has exited, so I imagine not all the X11 and TCP/IP connections have been closed. How can I figure out which of these ssh is waiting for so that I can fix my summon command's C++ application to clean up whatever is being left behind that keeps the ssh open.
I wonder why this failure only occurs some of the time and not on every invocation? It seems to occur approximately 50% of the time. What could my C++ application be leaving around to trigger this?
More background: The server is a daemon, when launched, it forks and the parent exits, leaving the child running. The client summons by using:
popen("ssh -n -o ConnectTimeout=300 user#host \"sererApp argsHere\""
" 2>&1 < /dev/null", "r")
Use libssh or libssh2, rather than calling popen(3) from C only to invoke ssh(1) which itself is another C program. If you want my personal experience, I'd say try libssh2 - I've used it in a C++ program and it works.
I find some hints here:
http://www.snailbook.com/faq/background-jobs.auto.html
This problem is usually due to a feature of the OpenSSH server. When writing an SSH server, you have to answer the question, "When should the server close the SSH connection?" The obvious answer might seem to be: close it when the server-side user program started by client request (shell or remote command) exits. However, it's actually a bit more complicated; this simple strategy allows a race condition which can cause data loss (see the explanation below). To avoid this problem, sshd instead waits until it encounters end-of-file (eof) on the pipes connecting to the stdout and stderr of the user program.
#sienkiew: If you really want to execute a command or script via ssh and exit, have a look at the daemontool of the libslack package. (Similar tools that can detach a command from its standard streams would be screen, tmux or detach.)
To inspect stdin, stdout & stderr of the command executed via ssh on the command line, you can, for example, use lsof.
# sample code to figure out why ssh session does not exit
# sleep keeps its stdout open, so sshd only sees EOF after command completion
ssh localhost 'sleep 10 &' # blocks
ssh localhost 'sleep 10 1>&- &' # does not block
ssh localhost 'sleep 10 & lsof -p ${!}'
ssh localhost 'sleep 10 1>&- & lsof -p ${!}'
ssh localhost 'sleep 10 1>/dev/null & lsof -p ${!}'
ssh localhost 'sleep 10 1>/dev/null 2>&1 & lsof -p ${!}'