Why does Cloudproc Juyterhub constantly disconnect? - google-cloud-platform

When creating a Dataproc instance and connecting via Jupyterhub, it constantly disconnects. This means any work on a Jupyter notebook in the Jupyterhub connection is lost. This seems to happen very frequently and appears to happen for many users, not just me (it happened to a class of about 6 people I teach). Here are the errors that happen (centered around failed to fetch):
This seems uncharacteristically poor for Google. Is there any way to fix it, or is it some fundamental problem with Dataproc and GCP? I don't have premium support so don't know how to write in to Google directly about it.

Related

Unable to connect to runtime & how to avoid disconnecting

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

My SSH session into my VM Cloud is suddenly lagging

Everyday I log into my SSH session of a Google Cloud VM I maintain (Debian).
Since a week ago, I noticed my performance was lagging as I typed into the VM or when doing something else. I mostly login into this VM to check log files of scheduled scripts I have, and even when I use "cat script.log", what used to take less than 2 seconds now takes at least 5 or 7 seconds, loading the log text.
Pinging different websites bring me an reasonable 10 - 15 ms. I'm pretty sure it's not about my local connection either, everything else I do works fine in my local computer.
A warning started to appear now into my session, saying
"Please consider adding the IAP-secured Tunnel User IAM role to start using Cloud IAP for TCP forwarding for better performance. Learn more Dismiss"
I've already configured the IAP secured tunnel to my account, which is the owner account of GCP project.
Another coworker of mine is being able to access the VM without any performance issues whatsoever.
Your issue is in my opinion with the ISP. For some reason the SSH sessions are lagging.
That's why even other computers using your home ISP lag SSH sessions too. If that was firewall rule interfering you wouldn't be able to connect at all.
You may try to reset all the network hardware in your home and if that doesn't help
run tracert command in windows shell and then contact your ISP and pass your findings. It's possible it's something on their end (and if not maybe their's ISP etc).
To solve the problem you need to add "IAP-secured Tunnel User" at the project level in IAM for that user.IAP-secured Tunnel User + See instructions here in a blog I wrote about this. That should solve your problem.

Keeping Datalab running

So I tried running some long calculations in a Datalab notebook. It should have finished overnight. It did not.
What happened is Google shut down my datalab instance a bit after midnight, without stating any reason I can find. Why did they do so? It appear google shut it down on their end. The computer did fall asleep at one point, when it became unplugged. Maybe if the computer falls asleep, it still looks like google closed it from their end. This is completely unacceptable default behavior. Google shouldn't be assuming people have their computers running 24/7.
How can i prevent google from shutting down instances which are still working?
On a semi-related note, how can I keep notebooks running even if my computer doesn't maintain the connection? If my internet goes down, if the computer goes to sleep, etc, I want my cloud notebook to continue running. That's part of the point of the cloud afterall.
See this: https://cloud.google.com/datalab/docs/concepts/auto-shutdown
After your computer is shutdown, your notebook's kernel will keep running on the VM (until it's auto-stopped, or you shut it down later), so if you were in the middle of a long-running notebook cell it should keep running to end. After the connection to the notebook is lost, the rest of the notebook won't run, and you won't be able to see the results of your commands inside the notebook.

EC2 Database through Laravel Forge has stopped being accessable

I've been running an instance EC2 through Laravel forge for about 2000 hours and this morning got this error while trying to reach it:
SQLSTATE[08006] [7] could not connect to server: Connection refused Is
the server running on host "172...***" and accepting TCP/IP
connections on port 5432?
After SSHing into the server I've getting a similar error when trying to run a command. I've dug through AWS but don't see any errors being throw. I double checked the ip address for the instance to make sure the IP hadn't changed for any reason. Of course I'm a little behind on my backups for the application so I'm hoping someone might have some ideas why else I can do to try and access this data. I haven't made any changes to the app in about 10 days, but found the error while I was pushing an update. I have six other instances of the same app that weren't affected (thankfully) but makes me even more confused with the cause of the issue.
In case anyone comes across a similar issue, here's what had happened. I had an error running in the background which had filled up the EC2 harddrive's log. Since the default Larvel/Forge image has a DB running within in the EC2 instance, once it ran out of room everything stopped working. I was able to SSH in and delete the log though, and everything started working again.
To prevent the issue from happening again I then created an amazon RDS and used that rather than the EC2 instance. It's about three or four times the price of just an EC2 instance, but still not that much and the confidence I now have in the system is well worth it.

How to determine that an AWS EC2 instance is still initialising from a script

Is there a way to determine through a command line interface or other trick if an AWS EC2 instance is ready to receive ssh connections?
The running state seems not to be enough. Trying to connect in in the first minutes of the running state, the machine Status checks still shows initialising and ssh times out while trying to connect.
(I am using the awscli pip package.)
Running is similar to turning a computer on and finishing a bios check. As far as the hypervisor is concerned your instance is on.
The best way to know when your instance is ready, is to run a script at the end of startup (or when certain services are on) that will report its status to some other listener. Using that data, or event, you should know that your instance is ready to be connected to. This is purposely vague since there are so many different ways this can be accomplished.
You could also time the expected startup time, and try to connect after that and retry the connection if it fails. Still need a point at which you would stop trying as instances can fail to launch in some cases.