Google Cloud VM Files Deleted after session disconnect - google-cloud-platform

I am having some of my GCP instances behave in a way similar to what is described in the below link:
Google Cloud VM Files Deleted after Restart
The session gets disconnected after a small duration of inactivity at times. On reconnecting, the machine is as if it is freshly installed. (Not on restarts as in the above link). All the files are gone.
As you can see in the attachment, it is creating the profile directory fresh when the session is reconnected. Also, none of the installations I have made are there. Everything is lost including the root installations. Fortunately, I have been logging all my commands and file set ups manually on my client. So, nothing is lost, but I would like to know what is happening and resolve this for good.
This has now happened a few times.
A point to note is that if I get a clean exit, like if I properly logout or exit from the ssh, I get the machine back as I have left, when I reconnect. The issue is there only when the session disconnects itself. There have been instances where the session disconnected and I was able to connect back as well.
The issue is not there on all my VMs.
From the suggestions from the link I have posted above:
I am not connected to the cloud shell. i am taking ssh of the machine using the chrome extension
Have not manually mounted any disks (afaik)
I have checked the logs from gcloud compute instances get-serial-port-output --zone us-east4-c INSTANCE_NAME. I could not really make much of it. Is there anything I should look for specifically?
Any help is appreciated.
Please find the links to the logs as suggested by #W_B
Below is from 8th when the machine was restarted and files deleted
https://pastebin.com/NN5dvQMK
It happened again today. I didn't run the command immediately then. The below file is from afterwards though
https://pastebin.com/m5cgdLF6
The below one is after logout today.
[4]: https://pastebin.com/143NPatF
Please note that I have replaced the user id, system name and a lot of numeric values in general using regexp. So, there is a slight chance that the time and other values have changed. Not sure if that would be a problem.
I have added the screenshot of the current config from the UI

Using locally attached SDD seems to be the cause ... here it is explained:
https://cloud.google.com/compute/docs/disks/local-ssd#data_persistence
You need to use a "persistent disk" - else it will behave just as you describe it.

Related

Browser drops connection during model training

I am currently trying to go through a fairly long hyperparameter grid search (4-5 hours) and I keep having issues with Jupyter Lab (or haven't figured out something yet) on a gcp notebook instance. The browser connection to the notebook keeps dropping, whereas the training process continues just fine. When it finishes training process, there's nowhere to write the output as the browser connection to the notebook has already dropped.
How can I keep that connection alive or make sure the output gets written into the notebook even if my laptop gets turned off/gets turned off?
There are multiple problems that may be affecting your notebook. It can be a GCP issue, a network issue... Therefore, you need to provide more information in order to diagnose what is happening. I would recommend you to open a ticket with GCP or Jupyter support to conduct a more thorough investigation as it can be something difficult to diagnose and they will have more tools to do it. Also, what #Joaquim suggested seems like a good workaround for the moment. Anyhow, I have gathered several troubleshooting steps that you can follow to find if it is one of this recurrent issues the one that is affecting you:
According to this Jupyter Notebook document, there is a ‘shutdown_no_activity_timeout’ option. The default value is ‘0’ that disables this automatic shutdown. The option might be overridden on ‘jupyter_notebook_config.py’ file. You may follow these steps to confirm it:
Click on the instance name of in which your Notebook is running on the AI Platform Notebooks page.
Remote access it by clicking “SSH”
Run this on the shell to confirm the existence of the overriding:
ls /home/*/.jupyter/jupyter_notebook_config.py
Run this command to confirm if the shutdown_no_activity_timeout option is doing the overriding:
cat /home/*/.jupyter/jupyter_notebook_config.py | grep shutdown_no_activity_timeout
Switch the option to ‘0’ if it is set to a different value, and reset the Notebook instances on this page to apply the change.
According to this other document, it might fail to connect when behind a proxy. You can try to disable your browser’s proxy settings.
You can also try to change the Jupyter port. On this Jupyter issue, the customer insists that his disconnection problem was gone after changing it. If you are using Chrome browser, could you please open the Inspect panel (Ctrl+Shift+I) and compare your connection symptoms with this image? If you get similar errors, you may try to change the port (c.NotebookApp.port).

How to read/get files from Google Cloud Compute Engine Disk without connecting into it?

I accidentally messed up the permissions of the file system, which shows the message sudo: /usr/local/bin/sudo must be owned by uid 0 and have the setuid bit set when attempting to use sudo, such as read protected files, etc.
Response from this answer (https://askubuntu.com/a/471503) suggest to login as root to do so, however I didn't setup a root password before and this answer (https://stackoverflow.com/a/35017164/4343317) suggest me to use sudo passwd. Obviously I am stuck in an infinite loop from the two answers above.
How can I read/get the files from Google Cloud Compute Engine's disk without logging in into the VM (I have full control of the VM instance and the disk as well)? Is there another "higher" way to login as root (such as from gcloud tool or the Google Cloud interface) to access the VM disk externally?
Thanks.
It looks like the following recipe may be of value:
https://cloud.google.com/compute/docs/disks/detach-reattach-boot-disk
What this article says is that you can shutdown your VM, detach its boot disk and then attach it as a data disk to a second VM. In that second VM you will have the ability to make changes. However, if you don't know what changes you need to make to restore the system to sanity, then as #John Hanley says, we might want to use this mounting technique to copy of your work and then destroy your tainted VM and recreate a new one fresh and copy back in your work and start from there.

Google Cloud Console Virtual Machine Mysteriously Deleted

This morning I logged into my pc and attempted to access remotely into a VM I have. No connection was the reported error. I log into my cloud console to find no projects.
Google Support is not available for me, as I have bronze package and I do not have 150$ available to upgrade it.
Are there any logs that could explain what happened? Did it just get wiped out? The instance is still there. But the machine itself is gone. I can't find any records of it. Please advise any help you can.
I believe your question also confuses others, what do you mean by "The instance is still there. But the machine itself is gone"?
Because you also mentioned "I log into my cloud console to find no projects. ", which means you should see nothing before you choose a valid project
Please be more specific about your questions
Could you indicate us step by step what you do in the Google Cloud Platform console? Where you click and what you type.
Please, check also the Activity tab on the home page of the console. Once in it, on the right-hand side, select Resource type: GCE VM instance, to see modifications in VMs.
We need to know exactly what you are seeing on each step, and any error code. Then we could see if the problem is in your procedure, or if there is an issue you should report to billing support, which is free, as pointed out by John Hanley in his comment.
Please, when you do this, make sure you don't include any personal information in the data you post here (such as project ID or password).

AWS File misplacement

I have a project deployed on EC2 instance and is up.
But sometime when I login through FTP and transfer the updated build to the EC2, some of my project file gets missing.
After a while those set of files is seen listed at the same place.
Couldn't relate why these unexpected behavior is happening. Let me know if anyone has faced similar kind of situation.
Or anyone can give me a way to know what all logins are being done through FTP and SSH on my EC2.
Files don't just randomly go missing on an EC2 instance. I suspect there is something going on and you'll need to diagnose it. There is not enough information here to help you but I can try point you in the right direction.
A few things that come to mind are:
What are you running to execute the ftp command? If it's appearing after some time, are you sure it's just not in progress when you first check then it appears when it's done? are you sure nothing is being cached?
Are you sure your FTP client is connected to the right instance?
Are you sure there are no cron tasks or external entities connecting to the instance and cleaning out a certain directory? You said something about the build, is this a build agent you're performing this on?
I highly doubt it's this one but: What type of volume are you working on? EBS? Instance Store? Instance Store is ephemeral so stopping/starting the instance can result in data being lost.
Have you tried using scp ?
If you're still stumped, please provide more info on your ec2 config and how you're transferring the file.

Ipython notebook remote server on AWS

I'm running a remote IPython notebook server on an EC2 instance on AWS. The instance is running Ubuntu.
Followed this tutorial to set up, and everything seems to work - I can access the notebook via https with a password and run code.
However, I can't seem to save changes to the notebook - It says "saving notebook" and then nothing happens (i.e, still written 'unsaved changes' on top).
Any ideas would be greatly appreciated.
Edit: It's not a permissions problem, since running in sudo doesn't help.
When creating a new notebook in the remote server, I am able to save. Problem only occurs for notebooks pulled from my git repository. Also, when opening a problematic notebook, and deleting all cells until it's absolutely empty, I can sometimes (!) save the empty notebook, and sometimes (!!) I still can't.
I've encountered an issue where notebooks wouldn't save on the nbserver on AWS EC2 instance I set up in a similar manner via different tutorial. It turns out I had to refresh and re-login using the password, because my browser would automatically log out have a certain period. Might help if you close and re-attempt to go the the nbserver and see if it asks you to re-login.
Here's a few other things you can try:
try to copy a problematic notebook into the server (scp) and try to open+save, as opposed to going thru repo pull to see if anything changes
check if the hanging "saving notebook" message appear for notebooks in certain directories
check the ipython console messages when you save a problematic notebook and see if anything there can help you pinpoint the issue