Problems connecting ssh to GCP's compute engine - google-cloud-platform

I paused and changed the cpu to improve the performance of the compute engine (ubuntu 18.04 ).
However, after executing after setting, ssh connection is not possible at all in console, vs code.
When ssh connection is attempted, the log of the gcp serial port is as follows.
May 25 02:07:52 nt-ddp-jpc GCEGuestAgent[1244]: 2021-05-25T02:07:52.4696Z GCEGuestAgent Info: Adding existing user root to google-sudoers group.
May 25 02:07:52 nt-ddp-jpc GCEGuestAgent[1244]: 2021-05-25T02:07:52.4730Z GCEGuestAgent Error non_windows_accounts.go:152: gpasswd: /etc/group.1540: No space left on device# 012gpasswd: cannot lock /etc/group; try again later.#012.
Also, when I try ssh in vs code I get permission denied error.
What is the exact cause and resolution of the problem?
Thanks all the time for your help.

No space left on device error.
To solve this issue, as John commented, you may follow this official guide of GCP in order to increase space on a full boot disk. It will be possible to log in through SSH after that procedure of increase size of boot disk.
As a best practice you may create a snapshot first, and keep in mind that increasing boot disk size and/or saving a snapshot could slightly increase the cost of your project.

Related

google-compute-engine resize disk is now empty and has the wrong OS. Also SSH is broken to the server and three websites are down. How to fix it?

GCloud resizing of disk resulted in an unusable server. Server doesn't have access to the data from the cloud shell because cloud shell isn't a VM and the port 22 is open but ssh won't work. Disk shows twice as large as it should be and the OS is incorrect.
I fixed it. Tools are available to read/write that might be considered back door. All three sites are back up. Thanks anyways.

My SSH session into my VM Cloud is suddenly lagging

Everyday I log into my SSH session of a Google Cloud VM I maintain (Debian).
Since a week ago, I noticed my performance was lagging as I typed into the VM or when doing something else. I mostly login into this VM to check log files of scheduled scripts I have, and even when I use "cat script.log", what used to take less than 2 seconds now takes at least 5 or 7 seconds, loading the log text.
Pinging different websites bring me an reasonable 10 - 15 ms. I'm pretty sure it's not about my local connection either, everything else I do works fine in my local computer.
A warning started to appear now into my session, saying
"Please consider adding the IAP-secured Tunnel User IAM role to start using Cloud IAP for TCP forwarding for better performance. Learn more Dismiss"
I've already configured the IAP secured tunnel to my account, which is the owner account of GCP project.
Another coworker of mine is being able to access the VM without any performance issues whatsoever.
Your issue is in my opinion with the ISP. For some reason the SSH sessions are lagging.
That's why even other computers using your home ISP lag SSH sessions too. If that was firewall rule interfering you wouldn't be able to connect at all.
You may try to reset all the network hardware in your home and if that doesn't help
run tracert command in windows shell and then contact your ISP and pass your findings. It's possible it's something on their end (and if not maybe their's ISP etc).
To solve the problem you need to add "IAP-secured Tunnel User" at the project level in IAM for that user.IAP-secured Tunnel User + See instructions here in a blog I wrote about this. That should solve your problem.

Is there any way to know which process is using memory in EC2 if I can't connect via ssh

last night I receive an error log (I use Rollbar) from my server with the message "NoMemoryError: failed to allocate memory"
When I was able to access my server, it took a lot, but I could connect by SSH. Sadly, every command I ran (free -m, top, ps, etc) I got "cannot fork: Cannot allocate memory".
Now I can't even access the server, I get "ssh_exchange_identification: read: Connection reset by peer"
This happened before and I just rebooted the machine, but now I want to know what is happening in order to prevent this to happen again. It's a m3.medium (with Ubuntu) and host a staging env, so I think it shouldn't have memory problems.
I wonder if is there any way, in the AWS Console, to see what is happening or free some memory in order to at least be able to connect via SSH.
Any ideas?
If you really have no idea what the problem is then write a script like this
#!/bin/bash
FILE=/var/log/memoryproblem.log
date +'%c' >> $FILE
free -m >> $FILE
ps axu |sort -rn -k 4,5|head >> $FILE
make cron run this at regular intervals
This will log quite a lot of information so clear up on a regular basis
Oh and another thing. There is one way of seeing log information on a host apart from ssh. In the aws console view of ec2 instances, select the instance and right click, instance settings -> system log may possibly be useful in this situation
Another thing to do is to temporarily increase the instance size. m3.medium only has 3.75GB of ram. If you up it to a m3.extralarge with 15GB of ram then it is possible the problem will occur and you can see what is going on due to the extra resource. Once you've fixed the issue you can go back to a smaller instance

Getting connection refused to EC2 by increasing the size of the primary volume

Here is what I do:
I choose Launch Instance from my ec2 dashboard
I select Ubuntu Server 12.04.3 LTS - 64-bit from the AMI list
I choose t1.micro as my instance type
I don't change anything on step2 (Configure Instance Details)
On step 4, I increase the size of the volume from 8 to 200
I click on Review and Launch and run the instance.
Now I can not ssh to the server although its state is running and Status Checks is 2/2 checks passed. I don't have any problem with doing the same steps except for increasing the volume size part. Any idea why this happens?
New EC2 instances with an EBS root volume are started from bootable hard drive snapshots with some metadata, collectively known as an Amazon Machine Image (AMI).
The question prompted me to realize something: given the fact that we're starting off with an existing hard drive image, with an established filesystem on it, it logically follows that no matter how large of a disk that filesystem is copied onto before being bound to our new machine, the filesystem had already been created, and it wouldn't normally be aware of the extra amount of space that happened to be available on the disk, above and beyond its original size.
And yet, I've selected larger-than-default size values for the root disk with the Ubuntu 12.04 LTS AMI and never given a moment's though to the fact that the amount of space I provisioned is "magically" available, when, logically, it shouldn't be. The filesystem should still be 8 GB after bootup, because it's a copy of a filesystem that was originally 8 GB in size, and all of its internal structures should still indicate this.
The only possible conclusion is that the snapshot we're initially booting must actually contain code to automatically grow its own filesystem to fill the disk that it wakes up to find itself running on.
This... turns out to be true. From an early write-up about EC2/EBS by Eric Hammond, describing how to get a larger root volume:
There’s one step left. We need to resize the file system so that it fills up the entire 100 GB EBS volume. Here’s the magic command for ext3. In my early tests it took 2-3 minutes to run. [Update: For Ubuntu 11.04 and later, this step is performed automatically when the AMI is booted and you don’t need to run it manually.] [emphasis added]
— http://alestic.com/2009/12/ec2-ebs-boot-resize
But, you apparently do need to wait for it, and the bigger the disk, the longer the wait. The instance reachability check sounds like a ping test. If so, it's conceivable that the network stack could be up and responsive but sshd might not yet available to accept connections during the resize operation, which would cause the "connection refused" response -- which is an active refusal by the IP stack because the destination socket isn't listening.
The instance reachability check doesn't so much mean the instance is "ready" as much as it means it is "on its way up, or up, and not on its way down."

My VMware ESX server console volume went readonly. How can I save my VMs?

Two RAID volumes, VMware kernel/console running on a RAID1, vmdks live on a RAID5. Entering a login at the console just results in SCSI errors, no password prompt. Praise be, the VMs are actually still running. We're thinking, though, that upon reboot the kernel may not start again and the VMs will be down.
We have database and disk backups of the VMs, but not backups of the vmdks themselves.
What are my options?
Our current best idea is
Use VMware Converter to create live vmdks from the running VMs, as if it was a P2V migration.
Reboot host server and run RAID diagnostics, figure out what in the "h" happened
Attempt to start ESX again, possibly after rebuilding its RAID volume
Possibly have to re-install ESX on its volume and re-attach VMs
If that doesn't work, attach the "live" vmdks created in step 1 to a different VM host.
It was the backplane. Both drives of the RAID1 and one drive of the RAID5 were inaccessible. Incredibly, the VMware hypervisor continued to run for three days from memory with no access to its host disk, keeping the VMs it managed alive.
At step 3 above we diagnosed the hardware problem and replaced the RAID controller, cables, and backplane. After restart, we re-initialized the RAID by instructing the controller to query the drives for their configurations. Both were degraded and both were repaired successfully.
At step 4, it was not necessary to reinstall ESX; although, at bootup, it did not want to register the VMs. We had to dig up some buried management stuff to instruct the kernel to resignature the VMs. (Search VM docs for "resignature.")
I believe that our fallback plan would have worked, the VMware Converter images of the VMs that were running "orphaned" were tested and ran fine with no data loss. I highly recommend performing a VMware Converter imaging of any VM that gets into this state, after shutting down as many services as possible and getting the VM into as read-only a state as possible. Loading a vmdk either elsewhere or on the original host as a repair is usually going to be WAY faster than rebuilding a server from the ground up with backups.