Cannot SSH into the GCP VM instances that used to work - google-cloud-platform

I created a few GCP VM instances yesterday all using the same configuration but running different tasks.
I could SSH into those instances via the GCP console and they were all working fine.
Today I want to check if the tasks are done, but I cannot SSH into any of those instances via the browser anymore...The error message reads:
Connection via Cloud Identity-Aware Proxy Failed
Code: 4010
Reason: destination read failed
You may be able to connect without using the Cloud Identity-Aware Proxy.
So I retried with Cloud Identity-Award Proxy disabled. But then it reads:
Connection Failed
An error occurred while communicating with the SSH server. Check the server and the network configuration.
Running
gcloud compute instances list
displayed all my instances and the status is RUNNING.
But when I ran
gcloud compute instances get-serial-port-output [instance-name]
using the [instance-name] returned from the above command. (This is to check if the boot disk of the instance has run out of free space.)
It returned
(gcloud.compute.instances.get-serial-port-output) Could not fetch serial port output: The resource '...' was not found
Some extra info:
I'm accessing the VM instance from the same internet (my home internet) and everything else is the same
I'm the owner of the project
My account is using a GCP free trial with $300 credit
The instances have machine type c2-standard-4 and are using Linux Deep Learning
The gcloud config looks right to me:
$ gcloud config list
[component_manager]
disable_update_check = True
[compute]
gce_metadata_read_timeout_sec = 5
[core]
account = [my_account]
disable_usage_reporting = True
project = [my_project]
[metrics]
environment = devshell
Update:
I reset one of the instances and now I can successfully SSH into that instance. However the job running on the instance stopped after reset.
I want to keep the jobs running on the other instances. Is there a way to SSH into other instances without reset?

You issue is at the VM side. Task's you're running make the ssh service unable to accept incoming connection and only after the restart you were able to connect.
You should be able to see the instance's serial console output using gcloud compute instances get-serial-port-output [instance-name] but if for some reason you're not You may try instead using GCP console - go to the instance's details and click on Serial port 1 (console) and you will see the output.
You may even interact with your VM (login) via the console. This is particularily usefull if something stopped the ssh service but for that you need a login/password so first you have to access the VM or use the startup script to add a user with your password. But then again - this requires a restart.
In either case it seems that the restarting your VM's is the best option. But you may try to figure out what is causing ssh service to stop after some time by inspecting logs. Or you can create your own (disk space, memory, cpu etc) by using cron with df -Th /mountpoint/path | tail -n1 >> /name_of_the_log_file.log.
You can for example use cron for checking & starting ssh service.
And if something doesn't work as supposed to (according to documentation) - go to the IssueTracker and create a new issue to get more help.

Related

Google Cloud not managing users/SSH in VMs

We have upgraded Debian distribution in Google Cloud instance and it seems GCloud cannot manage the users and their SSH keys in the instance anymore.
I have installed following tools:
google-cloud-packages-archive-keyring/now 1.2-499050965 all
google-cloud-sdk/cloud-sdk-bullseye,now 412.0.0-0 all
google-compute-engine-oslogin/google-compute-engine-bullseye-stable,now 1:20220714.00-g1+deb11 amd64
google-compute-engine/google-compute-engine-bullseye-stable,now 1:20220211.00-g1 all
google-guest-agent/google-compute-engine-bullseye-stable,now 1:20221109.00-g1 amd64
I cannot connect through the UI. It gets stuck on "Transfering SSH keys to the instance". The "troubleshooting" says that everything is fine.
When trying to connect via gcloud compute ssh it dies with
permission denied (publickey)
I still have access to the instance with some other user, but no new users are created and no SSH keys transferred.
What else am I missing?
EDIT:
Have you added the SSH key to Project metadata or Instance metadata? If its instance metadata, is project level ssh key blocked?
I haven't added any metadata.
Does your user account has necessary permission in the project to SSH to the instance (e.g Owner, Editor or Compute Instance Admin IAM role)?
Yes this worked correctly until the debian upgrade to bookworm. I could see all the google-cloud related packages were remove and I had to install them.
Are you able to SSH to the instance using ssh client e.g Putty?If yes, you need to make sure Google account manager daemon is running on the instance.
I can nicely SSH with accounts which were active on the machine BEFORE the Debian upgrade. These account already have .ssh directory correctly set up and working. New google users cannot login.
Try gcloud beta compute ssh --zone ZONE INSTANCE_NAME --project PROJECT
This works only for users active before the Debian upgrade.
 If yes, you need to make sure Google account manager daemon is running on the instance.
I installed the google-compute-engine-oslogin package which was missing, but it seems it has no effect and new users still cannot login.
EDIT2:
When connecting to serial console, it gets stuck on: csearch-dev google_guest_agent[2839775]: ERROR non_windows_accounts.go:158 Error updating SSH keys for gke-495d6b605cf336a7b160: mkdir /home/gke-495d6b605cf336a7b160/.ssh: no such file or directory. - the same issue, SSH keys are never transferred into the instance.
There are a few things you can do troubleshoot the Permission denied (publickey) error message :
To start, you must ensure that you have properly authenticated yourself with gcloud using an IAM user with the compute instance admin role. You can do that by running gcloud auth login [USER] then try gcloud compute ssh again.
You can also verify that the Linux Guest Environment scripts are properly installed and running. Please refer to this page for information about validating, updating, or manually installing the guest environment.
Another possibility is that the private key was lost or that we have a mismatched keypair. To force gcloud to generate a new SSH keypair, you must first move ~/.ssh/google_compute_engine and ~/.ssh/google_compute_engine.pub if present, for example:
mv ~/.ssh/google_compute_engine.pub ~/.ssh/google_compute_engine.pub.old
mv ~/.ssh/google_compute_engine ~/.ssh/google_compute_engine.old
Once that is done, you may then try gcloud compute ssh [INSTANCE-NAME] again, a new keypair should be created and a public key will be added to the SSH keys metadata.
Refer to Sunny-j and Answer to review the serial-port logs of the affected instance for possible clues on the issue. Also refer to Resolving getting locked out of a Compute Engine for more information.
Edit1:
Refer to this similar SO and Troubleshooting using the serial console which helps to resolve your error.
EDIT2:
Maybe you have git-all installed. Cloud-init and virtually every step of the booting process are disrupted as a result of this, as the older SysV init system takes its place. You are unable to SSH into your instance as a result of this.
Check out these potential solutions to the above problem:
1.Try using git instead of git-all.
2.If git-all is necessary, use apt install --no-install-recommends -y git-all to prevent the installation of recommendations.
Finally : If you were previously able to SSH into the instance with a particular SSH key for new users, either the SSH daemon was not running or was otherwise broken, or you somehow removed that SSH key. It would appear that you damaged this machine during the upgrade.
Why is this particular VM instance required? Does it contain significant data? If this is the case, you can turn it off, mount its disk with a new VM instance, and copy that data off.( I'd recommend build another machine running these services from latest snapshot or scratch and start using that instead).
You should probably move to a new machine if it runs a service: There is no way to tell what still works and what doesn't, even if you are able to access the instance.

Unable to launch Jupyter notebook "Setting up proxy to JupyterLab"

I created a VM Instance (n1-standard-8) for a project. I was using AI Platform > Workbench (Jupyter Notebook). I was able to read the data from Cloud storage and process it. After 2 months, I tried to start the notebook and clicked on 'OPEN JUPYTERLAB'. It just spins up saying "Setting up proxy to Jupyterlab".
Environment: Kaggle Python
Machine Type: n1-standard-8 (8 vCPUs, 30
GB RAM)
What is the possible issue?
PS: New to Google Cloud
One possible solution is to create a new VPC without adding the DNS rules for the various notebooks endpoints.
Then, use the configured network with a new notebook instance clicking the “OPEN JUPYTERLAB” URL.
You can see more information here.
Another possible thing that could be happening if you check your logs is an error that shows this “ - Required 'compute.instances.get' permission for project'”. This happens because you are using the non-default service account that you specified during the notebook creation. So the solution to this is to use the default service account.

Cannot connect via SSH to GCP Instance

Friends good night.
I have a server on Google Compute Engine, which I do not have access to via ssh and the old administrator did not leave access to it.
Is there any possibility to access this server either through SDK, GCP Console, etc.?
Thank you very much in advance.
If you or your team have an IAM account on the project with sufficient roles/permissions (Owner, ComputeAdmin), you can try the following:
Check this troubleshooting documentation in order to identify and solve your issue
Try to access the VM through the SerialPort.
I had mistakenly locked myself via these files /etc/hosts.allow and /etc/hosts.deny. It took me a day to get back access to the server and I hope below will help someone locked out of a GCP vm. It simply creates a script that runs when your VM is booting up. You can then have all commands to fix your issue run without direct access to the server. Below is how you can for example reset root password.
I am assuming that you have access to GCP console via browser, do below:-
Shutdown the server
Click on edit and scroll down to Custom metadata. Add a new item with key as startup-script and the value as below. Replace yournewpassword with the password you want to set for the root user:
#!/bin/sh
echo "yournewpassword:root" | chpasswd
Reboot your server and use your new password set above to ssh to your vm
Remove the meta and save your VM. You can reboot again.

How to resolve inability to ssh into google cloud instance when the gcloud compute ssh command becomes stuck/hangs without error message?

I'm trying to SSH into a Google Cloud Platform instance from an internal ip. When I run gcloud compute ssh instance-name --internal-ip, gcloud is able to find the zone automatically but hangs indefinitely without an error message. I have ran gcloud --verbosity=debug compute ssh instance-name --internal-ip to gain insight and see the program gets stuck at
DEBUG: Executing command: [u'/usr/bin/ssh', u'-T', u'-i', u'/home/edmond/.ssh/google_compute_engine', u'-o', u'CheckHostIP=no'...
Actions I have taken to try to resolve the issue:
Reinstall Google Cloud Platform SDK.
Run gcloud compute config-ssh --remove and then gcloud compute config-ssh to reset ssh keys.
Manually remove all google related files from ~/.ssh/ and then running gcloud compute config-ssh.
Adding my public ssh key in google cloud platform website at compute > metadata > ssh keys for the correct project.
Here are pieces of information which may be relevant:
I used to be able to ssh into instances without a problem and then one day this issue arised.
I am using Windows Subsystem for Linux (WSL) version 2 on Windows 10.
I am connected to a VPN which makes my IP internal which is why I use --internal-ip flag. There are no external IPs. That's not an option.
I get the same issue in the Google Cloud Platform website when I use the web-based console which means the issue is not isolated to only my machine.

why am I able to SSH from command line, but not from `datalab connect`?

I've been playing with google datalab and it's hard to get a connection to the notebook
I can create/launch an instance successfuly but usually the notebook is unavailable
$ datalab create [instance]
Connecting to [instance].
This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
Waiting for Datalab to be reachable at http://localhost:8081/
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
Connection broken
Attempting to reconnect...
Waiting for Datalab to be reachable at http://localhost:8081/
however, even while the notebook is unavailable, I can always SSH from the console
gcloud compute --project "[project]" ssh --zone "asia-east1-a" "[instance]"
sometimes I ^C and try again with datalab connect [instance] and it will eventually work.
Am I doing anything wrong, or is it just hit/miss?
It sometimes takes a few minutes for datalab to connect. If it does not connect, I also do as you describe and open a new Cloud Shell window (or use tmux to start another "tab") to run datalab connect [env] which usually works.
I believe this delay occurs because the web/notebook server takes time to start up after the environment is built.