How to debug unexpected instance termination on Google Cloud Computing - google-cloud-platform

I have a mongo database running on a Google Cloud Computing instance. For the second time now (in a few months), the server unexpectedly shut down into mode "TERMINATED". How do I find the cause of the shutdown?
The serial console just says, "The resource 'projects/my-project/zones/europe-west1-b/instances/mongo-db' is not ready".
I looked into the database logs, seems it received an external signal to shut down ("got signal 15 (Terminated)").
Nothing suspicious in the syslogs or messages logs after spinning up a new instance on the same disk. Also, there was no planned maintenance as far as I'm aware.
Any idea where to look?

Since your mongo database actually received a terminate signal, your instance was probably shutdown gracefully somehow. It sounds like something related to automatic migrations, but there are a couple of things to look at to help narrow this down.
In the Google Developers Console go to Compute -> Compute Engine -> VM instances -> mongo-db. There should be a section called "Availability policies." Check "On host maintenance" to make sure "Migrate VM instance" is selected. Otherwise, the VM will shutdown instead of migrating for maintenance.
You can also look at the operations for an instance at Compute -> Compute Engine -> Operations. This has all the operations that you and the system performed for your instances. You may see something around the time that the process terminated. You can also see this with the gcloud CLI with gcloud compute operations list

Related

Unable to connect to runtime & how to avoid disconnecting

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

Is there a way to create an Image (AMI) when a spot instance receives a termination request?

Just before a spot instances gets terminated - I'd like to start creating an image of the instance.
I tested and AWS waits for image generation to complete before the shutdown completes.
I also saw this may provide easy access to termination information but have yet to see it on my instance:
wget -q -O - http://169.254.169.254/latest/meta-data/spot/termination-time
Yes, you should be able to use the termination notice to trigger creation of an AMI. The AMI creation process will not prevent the instance from being terminated, but instance termination should not impact an AMI creation that has already started.
However, I would like to recommend that you do not do it.
Instead, you should create an application architecture that does not care about failure. Rather than trying to save the contents of the instance, you should code the system to work successfully even if the instance is terminated.
The best way to do this is to store all data and state external to the instance, such as in a database, Amazon S3 or in an Amazon SQS queue. Then, if the instance is terminated and later started again, it can resume its state and continue operating from the last "save point".
This is much like the situation where a computer loses power. When next turned on, it should be able to start up again successfully, recover what it was doing and continue working.
So, try and avoid "old world" thinking of saving everything on a disk. Instead, store the data somewhere that will survive the failure or termination of an instance.
Also, creating an AMI also makes a copy of the entire operating system and application installation. This is somewhat overkill for simply recovering application state and data. By all means launch the instance from an AMI with everything installed, but don't treat AMIs as "backups" -- rather, they are golden images that contain everything necessary to run or install an application.
I think Handling AWS Spot Instance Termination Notices is exactly what I wanted. Haven't tested it.

How can I scale CloudFoundry applications "down" without the risk of restarting all of them?

This is a question regarding the Swisscom Application Cloud.
I have implemented a strategy to restart already deployed CloudFoundry applications without using cf restart APP_NAME. I wanted to be able to:
restart running applications without needing access the app manifest and
avoid them suffering any down-time.
The general concept looks like this:
cf scale APP_NAME -I 2
increasing the instance count of the app from 1 to 2
wait for all app instances to be running
cf restart-app-instance APP_NAME 0
restart the "old" app instance
wait for all app instances to be running again
cf scale easyasset-repower-staging -I 1
decrease the instance count of the app back from 2 to 1
This generally works and usually does what I expect it to do. The problem I am having occurs at Stage (3), where sometimes instead of just scaling the instance count back, CloudFoundry will also restart all (remaining) instances.
I do not understand:
Why does this happen only sometimes (all apps restart when scaling down)?
Shouldn't CloudFoundry keep the the remaining instances up and running?
If cf scale is not able to keep perfectly fine running app instances alive - when is it useful?
Please Note:
I am well aware of the Bluegreen / Autopilot plugins for zero-down-time deployment of applications in CloudFoundry and I am actually using them for our deployments from our build server, but they require me to provide a manifest (and additional credentials), which in this case I don't have access to (unless I can somehow extract it from a running app via cf create-app-manifest?).
Update 1:
Looking at the plugins again I found bg-restage, which apparently does approximately what I want, but I have no idea how reliable that one is.
Update 2:
I have concluded that it's probably an obscure issue (or bug) in CloudFoundry and that there are no guarantees given by cf scale that existing instances are to remain running. As pointed out above, I have since realised that it is indeed very much possible to generate the app manifest on the fly (cf create-app-manifest) and even though I couldn't use the bg-restage plugin without errors, I reverted back to the blue-green-deploy plugin, which I can now hand a freshly generated manifest to avoid this whole cf scale exercise.
Comment Questions:
Why do you have the need to restart instances of your application?
We are caching some values from persistent storage on start-up. This restart is happening when changes to that data was detected.
information about the health-check
We are using all types of health checks, depending on which app is to be re-started (http, process and port). I have observed this issue only for apps with health checkhttp. I also have ahttp-endpoint` defined for the health check.
Are you trying to alter the memory with cf scale as well?
No, I am trying to keep all app configuration the same during this process.
When you have two running instances, the command
cf scale <APP> -i 1
will kill instance #1 and instance #0 will not be affected.

How to read logs before Deadline Exceeded on Init TPU system

I'm trying to run a model with Python 2.7 on a TPU with my own .tfrecord data file and all my code compiles, but the moment the TPU start doing its magic I don't have a clue what is going behind the scenes.
Is there a way to track what is going on behind the scenes with a tf.debugger or something similar?
This is the only error message I get:
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded on Init TPU system
Thank you!
General Debugging
There are a few ways you can get more information on what the TPU is doing.
The most straightforward is adding tf.logging statements. If you're using TPUEstimator you'll likely want to have this logging inside your model_fn as this is usually where the core TPU-executed logic is. Make sure that you have your verbosity set at the right level to capture anything you're logging. Note however that logging may impact the performance of your TPU more significantly than it would when running on other devices.
You can also get detailed information on what ops are running and taking up resources on the TPU using the Cloud TPU tools. These tools will add extra tabs to your TensorBoard.
These tools are more meant for performance tuning than for debugging, but they still may be of some use to you in seeing what ops are being run before a crash occurs.
Troubleshooting DeadlineExceededError
The specific issue you're running into may not be helped by more logging or profiling. The deadline exceeded error can be caused by an issue with the host connecting to the TPU. Normally when there's an error on the TPU, two stack traces will be returned, one from the host and one from the TPU. If you're not getting any trace from the TPU side, the host may have never been able to connect.
As a quick troubleshooting step you can try is stopping and restarting your TPU server:
gcloud compute tpus stop $TPU_SERVER_NAME && gcloud compute tpus start $TPU_SERVER_NAME
This usually resolves any issues that the host has communicating with the TPU. The command is copied from the very helpful TPU troubleshooting page.
The page also gives the most common reason that the connection between host and TPU is unable to be established in the first place:
If TensorFlow encounters an error during TPU execution, the script sometimes seems to hang rather than exit to the shell. If this happens, hit CTRL+\ on the keyboard to trigger a SIGQUIT, which causes Python to exit immediately.
Similarly, hitting CTRL+C during TPU execution does not shut down TensorFlow immediately, but instead waits until the end of the current iteration loop to exit cleanly. Hitting CTRL+\ causes Python to exit immediately.
If the TPU is still trying to finish the iteration loop from the last run, the host will be unable to connect. Using the suggested CTRL+\ can prevent this in the future.

How to determine that an AWS EC2 instance is still initialising from a script

Is there a way to determine through a command line interface or other trick if an AWS EC2 instance is ready to receive ssh connections?
The running state seems not to be enough. Trying to connect in in the first minutes of the running state, the machine Status checks still shows initialising and ssh times out while trying to connect.
(I am using the awscli pip package.)
Running is similar to turning a computer on and finishing a bios check. As far as the hypervisor is concerned your instance is on.
The best way to know when your instance is ready, is to run a script at the end of startup (or when certain services are on) that will report its status to some other listener. Using that data, or event, you should know that your instance is ready to be connected to. This is purposely vague since there are so many different ways this can be accomplished.
You could also time the expected startup time, and try to connect after that and retry the connection if it fails. Still need a point at which you would stop trying as instances can fail to launch in some cases.