Browser drops connection during model training - google-cloud-platform

I am currently trying to go through a fairly long hyperparameter grid search (4-5 hours) and I keep having issues with Jupyter Lab (or haven't figured out something yet) on a gcp notebook instance. The browser connection to the notebook keeps dropping, whereas the training process continues just fine. When it finishes training process, there's nowhere to write the output as the browser connection to the notebook has already dropped.
How can I keep that connection alive or make sure the output gets written into the notebook even if my laptop gets turned off/gets turned off?

There are multiple problems that may be affecting your notebook. It can be a GCP issue, a network issue... Therefore, you need to provide more information in order to diagnose what is happening. I would recommend you to open a ticket with GCP or Jupyter support to conduct a more thorough investigation as it can be something difficult to diagnose and they will have more tools to do it. Also, what #Joaquim suggested seems like a good workaround for the moment. Anyhow, I have gathered several troubleshooting steps that you can follow to find if it is one of this recurrent issues the one that is affecting you:
According to this Jupyter Notebook document, there is a ‘shutdown_no_activity_timeout’ option. The default value is ‘0’ that disables this automatic shutdown. The option might be overridden on ‘jupyter_notebook_config.py’ file. You may follow these steps to confirm it:
Click on the instance name of in which your Notebook is running on the AI Platform Notebooks page.
Remote access it by clicking “SSH”
Run this on the shell to confirm the existence of the overriding:
ls /home/*/.jupyter/jupyter_notebook_config.py
Run this command to confirm if the shutdown_no_activity_timeout option is doing the overriding:
cat /home/*/.jupyter/jupyter_notebook_config.py | grep shutdown_no_activity_timeout
Switch the option to ‘0’ if it is set to a different value, and reset the Notebook instances on this page to apply the change.
According to this other document, it might fail to connect when behind a proxy. You can try to disable your browser’s proxy settings.
You can also try to change the Jupyter port. On this Jupyter issue, the customer insists that his disconnection problem was gone after changing it. If you are using Chrome browser, could you please open the Inspect panel (Ctrl+Shift+I) and compare your connection symptoms with this image? If you get similar errors, you may try to change the port (c.NotebookApp.port).

Related

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Google Cloud VM Files Deleted after session disconnect

I am having some of my GCP instances behave in a way similar to what is described in the below link:
Google Cloud VM Files Deleted after Restart
The session gets disconnected after a small duration of inactivity at times. On reconnecting, the machine is as if it is freshly installed. (Not on restarts as in the above link). All the files are gone.
As you can see in the attachment, it is creating the profile directory fresh when the session is reconnected. Also, none of the installations I have made are there. Everything is lost including the root installations. Fortunately, I have been logging all my commands and file set ups manually on my client. So, nothing is lost, but I would like to know what is happening and resolve this for good.
This has now happened a few times.
A point to note is that if I get a clean exit, like if I properly logout or exit from the ssh, I get the machine back as I have left, when I reconnect. The issue is there only when the session disconnects itself. There have been instances where the session disconnected and I was able to connect back as well.
The issue is not there on all my VMs.
From the suggestions from the link I have posted above:
I am not connected to the cloud shell. i am taking ssh of the machine using the chrome extension
Have not manually mounted any disks (afaik)
I have checked the logs from gcloud compute instances get-serial-port-output --zone us-east4-c INSTANCE_NAME. I could not really make much of it. Is there anything I should look for specifically?
Any help is appreciated.
Please find the links to the logs as suggested by #W_B
Below is from 8th when the machine was restarted and files deleted
https://pastebin.com/NN5dvQMK
It happened again today. I didn't run the command immediately then. The below file is from afterwards though
https://pastebin.com/m5cgdLF6
The below one is after logout today.
[4]: https://pastebin.com/143NPatF
Please note that I have replaced the user id, system name and a lot of numeric values in general using regexp. So, there is a slight chance that the time and other values have changed. Not sure if that would be a problem.
I have added the screenshot of the current config from the UI
Using locally attached SDD seems to be the cause ... here it is explained:
https://cloud.google.com/compute/docs/disks/local-ssd#data_persistence
You need to use a "persistent disk" - else it will behave just as you describe it.

Not able to run cell on a jupyterlab notebook Google cloud ai platform

I am running 2 instances under Google AI Platform, which basically launches 2 VM instances to run jupyter lab. I have been happily making notebooks on both VMs. I shutdown both VMs for the day...
What's strange is that next morning, notebook from one VM will launch but when I run any cell containing simple things like "import pandas", it never return result and hang the whole thing (with a * where the cell # would have generated). I create a whole new notebook and just do a simple print("hello"). it also never returns. I restarted the instance a few times and still doesn't work. What I noticed is the "dot" on the top right corner is filled black. I think it should be white when the kernel is restarted. So there could be a problem with the kernel.
Any ideas what could go wrong? I don't even know where to debug this. The strange thing is the other VM still worked. I don't want to do anything drastic like re-creating a new VM, since I like to be able to fix this for a known cause.
Anyone out there experienced same thing?
In case you didn't attempt this, I would try refreshing the notebook window after restarting the machine.

GCP) How to keep jupyter session connected after disconnecting jupyter session from my local laptop?

I keep jupyter server running on GCP VM instance by tmux.
But the problem is that I wanna keep fitting my model after leaving jupyter server session from my local laptop
(eg. I turn off my laptop but jupyter session is still alive, fitting model, and I am able to re-connect that session to check status).
The only way I came up with is to use ~.py and execute $python3 fitting.py, but I wanna run and fit model on jupyter notebook to monitor avoiding adding extra code.
If there is a possible way to do so, please kindly teach me.
Thanks!
Have you considered using the Fairing library? It comes pre-installed with GCP's new AI Platform Notebooks.
This library allows you to pack up your notebook and send it off for remote execution. A new notebook will the executed content will be saved to your GCP Storage bucket. No active internet connection required once you kick of the notebook run.
You can learn how to use it by creating a new GCP AI Platform Notebook and looking at the tutorials folder inside it. You can also find additional tutorials for Fairing here

Ipython notebook remote server on AWS

I'm running a remote IPython notebook server on an EC2 instance on AWS. The instance is running Ubuntu.
Followed this tutorial to set up, and everything seems to work - I can access the notebook via https with a password and run code.
However, I can't seem to save changes to the notebook - It says "saving notebook" and then nothing happens (i.e, still written 'unsaved changes' on top).
Any ideas would be greatly appreciated.
Edit: It's not a permissions problem, since running in sudo doesn't help.
When creating a new notebook in the remote server, I am able to save. Problem only occurs for notebooks pulled from my git repository. Also, when opening a problematic notebook, and deleting all cells until it's absolutely empty, I can sometimes (!) save the empty notebook, and sometimes (!!) I still can't.
I've encountered an issue where notebooks wouldn't save on the nbserver on AWS EC2 instance I set up in a similar manner via different tutorial. It turns out I had to refresh and re-login using the password, because my browser would automatically log out have a certain period. Might help if you close and re-attempt to go the the nbserver and see if it asks you to re-login.
Here's a few other things you can try:
try to copy a problematic notebook into the server (scp) and try to open+save, as opposed to going thru repo pull to see if anything changes
check if the hanging "saving notebook" message appear for notebooks in certain directories
check the ipython console messages when you save a problematic notebook and see if anything there can help you pinpoint the issue