Unable to connect to runtime & how to avoid disconnecting - google-cloud-platform

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

Related

Computer Engine show normal, but all ports are unavailable

Why does the instance in my Computer Engine show normal, but all ports are unavailable, and a few hours ago, all functions were normal
A few days ago, my instance was attacked. Google sent me an email telling me that my instance was conducting mining activities and the resources were suspended. After I appealed, I deleted the instance and then recreated it. Now every time I create an instance , The instance can only be used for a few hours, then all the ports of the instance are unavailable, and the IP cannot be pinged
If someone could tell me what to do, I would really appreciate him
I would recommend you contact GCP support for this; they will be able to investigate your issue internally and tell you specifically the cause of the issue and the next steps for resolution.
https://cloud.google.com/support-hub

Why does Cloudproc Juyterhub constantly disconnect?

When creating a Dataproc instance and connecting via Jupyterhub, it constantly disconnects. This means any work on a Jupyter notebook in the Jupyterhub connection is lost. This seems to happen very frequently and appears to happen for many users, not just me (it happened to a class of about 6 people I teach). Here are the errors that happen (centered around failed to fetch):
This seems uncharacteristically poor for Google. Is there any way to fix it, or is it some fundamental problem with Dataproc and GCP? I don't have premium support so don't know how to write in to Google directly about it.

Hadoop single node cluster slows down AWS instance

Happy ugly Christmas sweater day :-)
I am running into some strange problems with my AWS Linux 16.04 instance running Hadoop 2.9.2.
I have just successfully installed and configured Hadoop to run in a simulated distributed mode. Everything seems to be fine. When I start hdfs and yarn I don't get any errors. But as soon as I try to do even something as simple as list the contents of the root hdfs directory, or create a new directory, the whole instance becomes super slow. I wait for about 10 min and it never produces a directory listing so I hit Ctrl+C and it takes another 5 minutes to kill the process. Then I try to stop both, the hdfs and yarn, and it succeeds but also takes a long time to do that. And even after hdfs and yarn have been stopped the instance is still being barely responsive. At this point all I can do to make it function normally again is to go to AWS console and restart it.
Does anyone have any idea what I might've screwed up ( I am pretty sure it's something I did. It usually is :-) )?
Thank you.
Well, I think I figured out what was wrong and the answer is trivial. Basically, my ec2 instance doesn't have enough RAM. It's a basic free tier eligible instance and by default it comes with only 1GB of RAM. Hilarious. Totally useless.
But I learned something useful anyway. One other thing I had to do to make my Hadoop installation work (I was getting "connection refused" error but I did make it work) was that in core-site.xml file I had to change the line that says
<value>hdfs://localhost:9000</value>
to
<value>hdfs://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws:9000</value>
(replace the XXXs in the above with your instance's IP address)

Keeping Datalab running

So I tried running some long calculations in a Datalab notebook. It should have finished overnight. It did not.
What happened is Google shut down my datalab instance a bit after midnight, without stating any reason I can find. Why did they do so? It appear google shut it down on their end. The computer did fall asleep at one point, when it became unplugged. Maybe if the computer falls asleep, it still looks like google closed it from their end. This is completely unacceptable default behavior. Google shouldn't be assuming people have their computers running 24/7.
How can i prevent google from shutting down instances which are still working?
On a semi-related note, how can I keep notebooks running even if my computer doesn't maintain the connection? If my internet goes down, if the computer goes to sleep, etc, I want my cloud notebook to continue running. That's part of the point of the cloud afterall.
See this: https://cloud.google.com/datalab/docs/concepts/auto-shutdown
After your computer is shutdown, your notebook's kernel will keep running on the VM (until it's auto-stopped, or you shut it down later), so if you were in the middle of a long-running notebook cell it should keep running to end. After the connection to the notebook is lost, the rest of the notebook won't run, and you won't be able to see the results of your commands inside the notebook.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!