AWS Sagemaker: Jupyter Notebook kernel keeps dying - amazon-web-services

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.

Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)

For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Related

Run Sagemaker notebook instance and be able to close tab

I'm currently using Sagemaker notebook instance (not from Sagemaker Studio), and I want to run a notebook that is expected to take around 8 hours to finish. I want to leave it overnight, and see the output from each cell, the output is a combination of print statements and plots.
Howevever, when I start running the notebook and make sure the initial cells run, I close the Jupyterlab tab in my browser, and some minutes after, I open it again to see how is it going, but the notebook is stopped.
Is there any way where I can still use my notebook as it is, see the output from each cell (prints and plots) and do not have to keep the Jupyterlab tab open (turn my laptop off, etc)?
Jupyter will stop your kernel when you close the tab. If you want to benefit from your jobs running after you close the jupyter tab, I would recommend looking into using SageMaker Processing or Training jobs for your workloads. Alternatively, this link provides some options on how to keep the notebook running with the tab closed.
Answering my own question.
I ended up using Sagemaker Processing jobs for this. As initially suggested by the other answer. I found this library developed a few months ago: Sagemaker run notebook, which helped still keep my notebook structure and cells as I had them, and be able to run it using Sagemaker run notebook using a bigger instance, and modifying the notebook in a smaller one.
The output of each cell was saved, along the plots I had, in S3 as a jupyter notebook.
I see that no constant support is given to the library, but you can fork it and make changes to it, and use it as per your requirements. For example, creating a docker container based on your needs.

Not able to run airflow in Cloud run getting error disk I/o error

I am trying to run airflow in google cloud run.
Getting error Disk I/O error, I guess the disk write permission is missing.
can someone please help me with this how to give write permission inside cloud run.
I also have to write file and later delete it.
Only the directory /tmp is writable in Cloud Run. So, change the default write location to write into this directory.
However, you have to be aware of 2 things:
Cloud Run is stateless, that means when a new instance is created, the container start from scratch, with an empty /tmp directory
/tmp directory is an in-memory file system. The maximum allowed memory on Cloud Run is 2Gb, your app memory footprint included. In addition of your file and Airflow, not sure that you will have a lot of space.
A final remark. Cloud Run is active only when it process request, and a request has a maximum timeout of 15 minutes. When no request, the allowed cpu is close to 0%. I'm not sure of what you want to achieve with Airflow on Cloud Run, but my feeling tells me that your design is strange. And I prefer to warn you before you spend too much effort on this.
EDIT 1:
Cloud Run service has evolved in the right way. In 2022,
/tmp is no longer the only writable directory (you can write everywhere, but it's still in memory)
the timeout is no longer limited to 15 minutes, but to 60 minutes
The 2nd gen runtime execution (still in preview) allows you to mount NFS (Filestore) or Cloud Storage (GCSFuse) volume to have services "more stateful".
You can also execute jobs now. So, a lot of very great evolution!
My impression is that you have a write i/o error because you are using SQLite. Is that possible.
If you want to run Airflow using cointainers, I would recommend to use Postgres or MySQL as backend databases.
You can also mount the plugins and dag folder in some external volume.

Browser drops connection during model training

I am currently trying to go through a fairly long hyperparameter grid search (4-5 hours) and I keep having issues with Jupyter Lab (or haven't figured out something yet) on a gcp notebook instance. The browser connection to the notebook keeps dropping, whereas the training process continues just fine. When it finishes training process, there's nowhere to write the output as the browser connection to the notebook has already dropped.
How can I keep that connection alive or make sure the output gets written into the notebook even if my laptop gets turned off/gets turned off?
There are multiple problems that may be affecting your notebook. It can be a GCP issue, a network issue... Therefore, you need to provide more information in order to diagnose what is happening. I would recommend you to open a ticket with GCP or Jupyter support to conduct a more thorough investigation as it can be something difficult to diagnose and they will have more tools to do it. Also, what #Joaquim suggested seems like a good workaround for the moment. Anyhow, I have gathered several troubleshooting steps that you can follow to find if it is one of this recurrent issues the one that is affecting you:
According to this Jupyter Notebook document, there is a ‘shutdown_no_activity_timeout’ option. The default value is ‘0’ that disables this automatic shutdown. The option might be overridden on ‘jupyter_notebook_config.py’ file. You may follow these steps to confirm it:
Click on the instance name of in which your Notebook is running on the AI Platform Notebooks page.
Remote access it by clicking “SSH”
Run this on the shell to confirm the existence of the overriding:
ls /home/*/.jupyter/jupyter_notebook_config.py
Run this command to confirm if the shutdown_no_activity_timeout option is doing the overriding:
cat /home/*/.jupyter/jupyter_notebook_config.py | grep shutdown_no_activity_timeout
Switch the option to ‘0’ if it is set to a different value, and reset the Notebook instances on this page to apply the change.
According to this other document, it might fail to connect when behind a proxy. You can try to disable your browser’s proxy settings.
You can also try to change the Jupyter port. On this Jupyter issue, the customer insists that his disconnection problem was gone after changing it. If you are using Chrome browser, could you please open the Inspect panel (Ctrl+Shift+I) and compare your connection symptoms with this image? If you get similar errors, you may try to change the port (c.NotebookApp.port).

Not able to run cell on a jupyterlab notebook Google cloud ai platform

I am running 2 instances under Google AI Platform, which basically launches 2 VM instances to run jupyter lab. I have been happily making notebooks on both VMs. I shutdown both VMs for the day...
What's strange is that next morning, notebook from one VM will launch but when I run any cell containing simple things like "import pandas", it never return result and hang the whole thing (with a * where the cell # would have generated). I create a whole new notebook and just do a simple print("hello"). it also never returns. I restarted the instance a few times and still doesn't work. What I noticed is the "dot" on the top right corner is filled black. I think it should be white when the kernel is restarted. So there could be a problem with the kernel.
Any ideas what could go wrong? I don't even know where to debug this. The strange thing is the other VM still worked. I don't want to do anything drastic like re-creating a new VM, since I like to be able to fix this for a known cause.
Anyone out there experienced same thing?
In case you didn't attempt this, I would try refreshing the notebook window after restarting the machine.

Slow performance after syncing storage bucket with gcsfuse

I've run into a bug that seemingly has no explanation as to why its happening.
Running scripts that are in my mounted drive(mounted using gcsfuse) takes forever to run(almost a couple of minutes per command). However, any scripts that I run from outside the mount folder seems to work fine. I can also notice a significant lag in the cursor movement too. Using GCS is a must for me as my dataset is already uploaded there and trying to rsync it to the VM takes forever.
I would appreciate any help regarding this issue.
Here is the cloud monitoring dashboard for some metrics I measured.