Keeping Datalab running - google-cloud-platform

So I tried running some long calculations in a Datalab notebook. It should have finished overnight. It did not.
What happened is Google shut down my datalab instance a bit after midnight, without stating any reason I can find. Why did they do so? It appear google shut it down on their end. The computer did fall asleep at one point, when it became unplugged. Maybe if the computer falls asleep, it still looks like google closed it from their end. This is completely unacceptable default behavior. Google shouldn't be assuming people have their computers running 24/7.
How can i prevent google from shutting down instances which are still working?
On a semi-related note, how can I keep notebooks running even if my computer doesn't maintain the connection? If my internet goes down, if the computer goes to sleep, etc, I want my cloud notebook to continue running. That's part of the point of the cloud afterall.

See this: https://cloud.google.com/datalab/docs/concepts/auto-shutdown
After your computer is shutdown, your notebook's kernel will keep running on the VM (until it's auto-stopped, or you shut it down later), so if you were in the middle of a long-running notebook cell it should keep running to end. After the connection to the notebook is lost, the rest of the notebook won't run, and you won't be able to see the results of your commands inside the notebook.

Related

Unable to connect to runtime & how to avoid disconnecting

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

'Server Connection Error' on GCP (AI Platform Notebook)

I am facing some issues with GCP and the AI Platform (Jupyterlab)
It seems that I am unable to maintain a stable connection with the server for a long time. I keep getting those 'server connection error' message. From there two possibilities:
either nothing happens and my cell keeps running
or
the cells have stopped running and I can see the status 'No Kernel!' on the top right of the notebook. Whenever I select a kernel (python 3) again, depending on my luck I can either keep working, or the cell will display the running status (with the * on the left of it) but the kernel status on the bottom left will stay on : 'connected' (instead of 'busy'). For the latter, I need to restart the kernel and run all the cells again, which can be very long.
Sometimes this happens as soon as I run the first cell after (re)starting the instance, sometimes a bit later. The longest stable period I was able to work on the notebook without any issue was 20, 30-ish minutes, which is quite annoying.
Configuration of my main instance :
- 16 CPUs
- 60gb of RAM
- A P100 NVIDIA GPU
I have tried different types of instance and I keep having the same problem, network at home is stable.
error message
What operating system and browser are you using at work?
I had the same problem as you did on Ubuntu 18 with the Firefox browser.
When I switched to Windows with Chrome the error did not reoccur, even though it was the same network.
I had a similar issue today: according to the google docs the cause for this is that the docker/ Jupyter service is not starting.
The cause why these services couldn't be started in our specific case was a full disk.

Why does Cloudproc Juyterhub constantly disconnect?

When creating a Dataproc instance and connecting via Jupyterhub, it constantly disconnects. This means any work on a Jupyter notebook in the Jupyterhub connection is lost. This seems to happen very frequently and appears to happen for many users, not just me (it happened to a class of about 6 people I teach). Here are the errors that happen (centered around failed to fetch):
This seems uncharacteristically poor for Google. Is there any way to fix it, or is it some fundamental problem with Dataproc and GCP? I don't have premium support so don't know how to write in to Google directly about it.

Google Cloud Platform jupyter notebook still runnig after off local PC

I'm new at GCP and I'm trying to keep my process running on Jupyter Notebook after shutting down my local PC. Does anyone know how can I do it? Nowaday I open a terminal on my VM run jupter notebook and then after start the process on jupyter I'd like to turn my machine off.
I keep following the process on my cellphone and shutdown on there. Does anyone know how to turn this off automatically when it stops?
Sorry to make two questions at once, but I think that one is related with another. If it does not I can edit and make another one.
This is a technical limitation of Jupyter Notebooks unfortunately. The browser window contains the code which updates the notebook itself, so if you close the browser window then there is not process running to update the notebook.
However, there is one workaround which you may find useful.
There is a library called Fairing that you can use with GCP's new AI Platform Notebooks which allows you to pack up your notebook and run it remotely, and that library will save the results of that execution in a GCP Storage bucket. No active internet connection required (once you kick of the notebook run).
You can learn how to use it by creating a new GCP AI Platform Notebook and looking at the tutorials folder inside it. You can also find additional tutorials for Fairing here
Typically to keep your remote sessions up in the event of network connectivity loss (which also covers shutting down the local computer) you'd use a terminal multiplexer application. From Known issues:
Intermittent disconnects: At this time, we do not offer a specific SLA for connection lifetimes. Use terminal multiplexers like tmux
or screen if you plan to keep the terminal window open for an
extended period of time.
But these multiplexers are terminal/text-mode apps, so you'd have to launch the notebook with the --no-browser and then connect your local browser to its port.
You can find a recipe based on tmux and a local browser connection to the notebook using an SSH tunnel at Using Jupyter notebooks securely on remote linux machines.
As for shutting down the session - you'd just have to instruct the multiplexer application to end the session (or terminate the multiplexer app itself) - which you could do automatically via a wrapper script first invoking your process and immediately after the process ends invoking the commands to shutdown the session.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!