How to read logs before Deadline Exceeded on Init TPU system - python-2.7

I'm trying to run a model with Python 2.7 on a TPU with my own .tfrecord data file and all my code compiles, but the moment the TPU start doing its magic I don't have a clue what is going behind the scenes.
Is there a way to track what is going on behind the scenes with a tf.debugger or something similar?
This is the only error message I get:
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded on Init TPU system
Thank you!

General Debugging
There are a few ways you can get more information on what the TPU is doing.
The most straightforward is adding tf.logging statements. If you're using TPUEstimator you'll likely want to have this logging inside your model_fn as this is usually where the core TPU-executed logic is. Make sure that you have your verbosity set at the right level to capture anything you're logging. Note however that logging may impact the performance of your TPU more significantly than it would when running on other devices.
You can also get detailed information on what ops are running and taking up resources on the TPU using the Cloud TPU tools. These tools will add extra tabs to your TensorBoard.
These tools are more meant for performance tuning than for debugging, but they still may be of some use to you in seeing what ops are being run before a crash occurs.
Troubleshooting DeadlineExceededError
The specific issue you're running into may not be helped by more logging or profiling. The deadline exceeded error can be caused by an issue with the host connecting to the TPU. Normally when there's an error on the TPU, two stack traces will be returned, one from the host and one from the TPU. If you're not getting any trace from the TPU side, the host may have never been able to connect.
As a quick troubleshooting step you can try is stopping and restarting your TPU server:
gcloud compute tpus stop $TPU_SERVER_NAME && gcloud compute tpus start $TPU_SERVER_NAME
This usually resolves any issues that the host has communicating with the TPU. The command is copied from the very helpful TPU troubleshooting page.
The page also gives the most common reason that the connection between host and TPU is unable to be established in the first place:
If TensorFlow encounters an error during TPU execution, the script sometimes seems to hang rather than exit to the shell. If this happens, hit CTRL+\ on the keyboard to trigger a SIGQUIT, which causes Python to exit immediately.
Similarly, hitting CTRL+C during TPU execution does not shut down TensorFlow immediately, but instead waits until the end of the current iteration loop to exit cleanly. Hitting CTRL+\ causes Python to exit immediately.
If the TPU is still trying to finish the iteration loop from the last run, the host will be unable to connect. Using the suggested CTRL+\ can prevent this in the future.

Related

ChromeOS errors in GCP Logging

I'm seeing errors in StackDriver logging for my Compute instance. The logs are showing repeated issues every hour, creating a lot of noise. I have a Spring Boot API deployed in a container to a VM in Compute Engine using latest stable version of Container OS.
I'm relatively new to GCP and don't understand what is causing this issue, searches have come up empty so far.
Failed to call method: org.chromium.SessionManagerInterface.RetrieveActiveSessions: object_path= /org/chromium/SessionManager: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.SessionManager was not provided by any .service files
CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.SessionManager was not provided by any .service file
Error calling D-Bus proxy call to interface '/org/chromium/SessionManager': The name org.chromium.SessionManager was not provided by any .service files
The same 3 lines are repeating every hour. Anyone aware of what might be causing this or how to fix/suppress these?
I looked into this error, and as per my findings:
The error message that you have been receiving is a manifestation of Chrome to reliably exit shortly after starting up.
The UI’s job (which encompasses Chrome, the session_manager and the window manager) gets shut down by upstart because of it's thrashing, and when the test tries to restart the session_manager, the session_manager cannot communicate it over to the D-Bus.
The crash collection software in Container OS was originally for Chromebooks (The laptop using Chrome browser). So the code typically expects Chrome and some other related software on the system.
However, Container OS is a server OS, and does not have Chrome. So if Chrome is missing, the software will report some errors. They are actually not real failures, just some verbose error messages.
Overall, It is safe to ignore these logs and continue using your VM Instances.
Hope this helps.

Shutdown scripts to run upon AWS termination

I am trying to get some scripts to run upon an aws termination action. I have created /etc/init.d/Script.sh and linked symbolically to /etc/rc01.d/K01Script.sh
However terminating through aws console did not produce the output I was looking for. (It is a script that does a quick API call to a server over https should take only a few seconds).
Then I tried again but specifically changed a kernel parameter:
'sudo sysctl -w kernel.poweroff_cmd=/etc/rc0.d/K01Script.sh'
and again no output.
I get the message "The system is going down for power off NOW!" when terminating the server so I'm pretty sure the Ubuntu server is going into runlevel 0. Permissions are owned by root.
I know I could create a lifecycle to do something like this but my team prefers the quick and dirty way.
any help very much appreciated!

How to determine that an AWS EC2 instance is still initialising from a script

Is there a way to determine through a command line interface or other trick if an AWS EC2 instance is ready to receive ssh connections?
The running state seems not to be enough. Trying to connect in in the first minutes of the running state, the machine Status checks still shows initialising and ssh times out while trying to connect.
(I am using the awscli pip package.)
Running is similar to turning a computer on and finishing a bios check. As far as the hypervisor is concerned your instance is on.
The best way to know when your instance is ready, is to run a script at the end of startup (or when certain services are on) that will report its status to some other listener. Using that data, or event, you should know that your instance is ready to be connected to. This is purposely vague since there are so many different ways this can be accomplished.
You could also time the expected startup time, and try to connect after that and retry the connection if it fails. Still need a point at which you would stop trying as instances can fail to launch in some cases.

How to debug unexpected instance termination on Google Cloud Computing

I have a mongo database running on a Google Cloud Computing instance. For the second time now (in a few months), the server unexpectedly shut down into mode "TERMINATED". How do I find the cause of the shutdown?
The serial console just says, "The resource 'projects/my-project/zones/europe-west1-b/instances/mongo-db' is not ready".
I looked into the database logs, seems it received an external signal to shut down ("got signal 15 (Terminated)").
Nothing suspicious in the syslogs or messages logs after spinning up a new instance on the same disk. Also, there was no planned maintenance as far as I'm aware.
Any idea where to look?
Since your mongo database actually received a terminate signal, your instance was probably shutdown gracefully somehow. It sounds like something related to automatic migrations, but there are a couple of things to look at to help narrow this down.
In the Google Developers Console go to Compute -> Compute Engine -> VM instances -> mongo-db. There should be a section called "Availability policies." Check "On host maintenance" to make sure "Migrate VM instance" is selected. Otherwise, the VM will shutdown instead of migrating for maintenance.
You can also look at the operations for an instance at Compute -> Compute Engine -> Operations. This has all the operations that you and the system performed for your instances. You may see something around the time that the process terminated. You can also see this with the gcloud CLI with gcloud compute operations list

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!