'Server Connection Error' on GCP (AI Platform Notebook)

'Server Connection Error' on GCP (AI Platform Notebook) - google-cloud-platform

I am facing some issues with GCP and the AI Platform (Jupyterlab)
It seems that I am unable to maintain a stable connection with the server for a long time. I keep getting those 'server connection error' message. From there two possibilities:
either nothing happens and my cell keeps running
or
the cells have stopped running and I can see the status 'No Kernel!' on the top right of the notebook. Whenever I select a kernel (python 3) again, depending on my luck I can either keep working, or the cell will display the running status (with the * on the left of it) but the kernel status on the bottom left will stay on : 'connected' (instead of 'busy'). For the latter, I need to restart the kernel and run all the cells again, which can be very long.
Sometimes this happens as soon as I run the first cell after (re)starting the instance, sometimes a bit later. The longest stable period I was able to work on the notebook without any issue was 20, 30-ish minutes, which is quite annoying.
Configuration of my main instance :
- 16 CPUs
- 60gb of RAM
- A P100 NVIDIA GPU
I have tried different types of instance and I keep having the same problem, network at home is stable.
error message

What operating system and browser are you using at work?
I had the same problem as you did on Ubuntu 18 with the Firefox browser.
When I switched to Windows with Chrome the error did not reoccur, even though it was the same network.

I had a similar issue today: according to the google docs the cause for this is that the docker/ Jupyter service is not starting.
The cause why these services couldn't be started in our specific case was a full disk.

Related

How do I get past ColdFusion server-specific error code 2?

I had installed ColdFusion 2018 recently and with the installation less than a month old (and my understanding of the technology even less), my Cold Fusion service has stopped working. I have tried a number of things and have referred to a number of articles and out of many such errors where the service is not being accessible, some of them were able to get it resolved. However, some other obscure reason that may be causing this error have been untouched and unknown.
Whenever, I try to restart the service, I get an error as shown below:
Windows could not start the ColdFusion 8 Application Server on Local Computer. For more information, review the System Event Log. If this is a non-Microsoft service, contact the service vendor, and refer to server-specific code error 2.”
Without much understanding, I started to google it out. Looking into every one of these posts, I tried
Configure JRE and try to relaunch the service by looking at "JAVA_HOME" variable and JVM.config
Run the batch files in every possible combination to find if anything clicks
Check if the present JAVA version works and is compatible with Coldfusion version installed
Fiddling with the "SessionStorage" var in neo-runtime.xml file as some suggested
and few other tricks coupled with a numerous service restart attempts and a few machine reboots as well.
A service that renders Cold Fusion pages should be shut down abruptly. To add to agony, the CF Admin also depends on the service and hence does not work.
Any pointers to any potential solutions?

ChromeOS errors in GCP Logging

I'm seeing errors in StackDriver logging for my Compute instance. The logs are showing repeated issues every hour, creating a lot of noise. I have a Spring Boot API deployed in a container to a VM in Compute Engine using latest stable version of Container OS.
I'm relatively new to GCP and don't understand what is causing this issue, searches have come up empty so far.
Failed to call method: org.chromium.SessionManagerInterface.RetrieveActiveSessions: object_path= /org/chromium/SessionManager: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.SessionManager was not provided by any .service files
CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.SessionManager was not provided by any .service file
Error calling D-Bus proxy call to interface '/org/chromium/SessionManager': The name org.chromium.SessionManager was not provided by any .service files
The same 3 lines are repeating every hour. Anyone aware of what might be causing this or how to fix/suppress these?

I looked into this error, and as per my findings:
The error message that you have been receiving is a manifestation of Chrome to reliably exit shortly after starting up.
The UI’s job (which encompasses Chrome, the session_manager and the window manager) gets shut down by upstart because of it's thrashing, and when the test tries to restart the session_manager, the session_manager cannot communicate it over to the D-Bus.
The crash collection software in Container OS was originally for Chromebooks (The laptop using Chrome browser). So the code typically expects Chrome and some other related software on the system.
However, Container OS is a server OS, and does not have Chrome. So if Chrome is missing, the software will report some errors. They are actually not real failures, just some verbose error messages.
Overall, It is safe to ignore these logs and continue using your VM Instances.
Hope this helps.

Access to Amazon EC2 takes LONG time

I have a t2 small machine with a weird problem that cripples my site.
Access to even getting a single small logo image can take from less than a second to more than a minute. I just do F5 refresh on the browser and it takes various times to get a small png!!
I have more than 100 cpu credits.
No errors on apache error log
In my tests, I'm accessing it with the IP address to bypass the ELB, but still some refreshes in the browser takes random time from immediate to a minute! sometimes it returns 504 error because it was more than 60 seconds.
It is an Ubuntu machine that used to work ok. Apache 2.4 with KeepAliveTimeout=5
Any ideas?
Thanks

This is assuming that the site "used to work ok"
1) Check for any changes you have made to the site's configuration. For example you might have installed a misconfigured module into apache. If you have an old snapshot then resurrect it and compare with what you have now
2) look in the /var/log/syslog. If there are any mysterious messages that look like a potential hardware fault then do a EC2 stop and start to move the vm to a different physical host

Is there a way to speed up recovery from a crash

I'm trying to find a way to switch Calabash to next Scenario after noticing a crash
Retrying.. HTTPClient::ReceiveTimeoutError: (execution expired)
Retrying.. HTTPClient::ReceiveTimeoutError: (execution expired)
Failing... HTTPClient::ReceiveTimeoutError
Otherwise it can take up to half an hour before Calabash reestablish connection to Simulator and starts the next Scenario.

half an hour before Calabash reestablish connection to Simulator
This is very unusual and typically indicates a problem with UIAutomation.
Have you seen the Hot Topics page? In particular:
NSLog output can cause apps to become unresponsive during testing.
My best guess is that instruments is hanging for some reason. Below, I provide details about various variables and their defaults that influence launching and connecting the Calabash server.
I don't think that adjusting any of the variables below will make an difference in your case.
Reporting Problems
In the future, please include the details found in the Report Problems section of the Calabash iOS Wiki Home Page.
Environment Variables
You can find documentation about all the Calabash iOS environment variables here.
There are several variables you can use to control how long Calabash will wait for a response.
In Calabash iOS, two things need to happen before tests can begin:
The instruments command-line tool must launch the app and respond that it has launched the app.
Calabash must establish a connection with the embedded server.
You can control how long run-loop waits for instruments to launch the app and report back using the UIA_TIMEOUT environment variable. The default is 10 seconds. Calabash tells run-loop to try 3 times, for a total of 30 seconds. Unfortunately, there are no public API docs for run-loop.
The there are two environment variables that control how long Calabash will try to establish a connection with the embedded server:
CONNECT_TIMEOUT
MAX_CONNECT_RETRY
The default is try to reconnect once every 3 seconds 10 times for a total of 30 seconds.
These two variables are also used every time a query or gesture is made - how long does Calabash wait for the server to reply.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?

Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes

The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js