Google Cloud Platform - Stack Driver Enabled - 100% Compute Errors - google-cloud-platform

I developed and support a client's mobile app that uses Firebase services.
Google Cloud Platform logged this event yesterday at 4:17 am:
'<my account email> has executed
google.api.serviceusage.v1.ServiceUsage.EnableService
on stackdriver.googleapis.com'
I was sleeping at the time and a review of Google Admin Console Login Audit Log does not show a login event around that same time.
Immediately, 100% errors were reporting for 'compute':
A look at the Stackdriver API overview page does not give any indication of activity:
My question, my concern, how/why did this service get activated and what is the activity driving the compute errors at 100%?
During my efforts to understand, I clicked on Compute Engine API in the API library, which enabled the API (but no VMs, Disk, etc. were created):
A short time later, Google Cloud Platform has several log entries:
google.devtools.cloudbuild.v1.CloudBuild.ListBuilds
was executed on builds
Number of returned items 1000
The 'compute' errors stopped.
When I disabled the Compute Engine API, the ListBuilds logs stopped, but the Computer Errors returned to 100%.

I have not found a definitive answer to my question.
It's clear that Stackdriver API was enabled, but I don't know why.
When enabled, 100% Compute errors were being reported (orange line on graph) without any details.
While customizing the Google Cloud Platform Dashboard for this account, I toggled/enabled the Compute Engine card/graph hoping that might reveal some clues regarding the 100% errors. That action initialized the Compute Engine API. Almost immediately the Compute errors ended but there was a surge of activity that has continued. Reviewing many resources I found information that suggest this is normal behavior.
I would still like to fully understand how Stackdriver was enabled, why it was enabled, what value it provides, if I can simply disable it and Compute API as this project will never require VM compute services.

Related

How to get logs for Compute Engine API errors?

I am a total beginner in cloud service management, so this is a very basic question.
I have inherited a kubernetes based project running in Google Cloud. I have discovered recently that there are millions of errors I am unaware of in APIs & Services > Compute Engine API > Metrics menu:
I have tried searching for these values both on google in the docs to no avail. With no link to the list of logs and hundreds of sub menu items I feel completely lost on where to start.
How can I get more information about these errors?
How can I navigate to the relevant logs?
Your question is rather general so I will make some assumptions and educated guesses about your project and try to explain.
This level of error with API calls is of course unusually high and suggesting that some things don't work (for example someone deleted a backend service but left the load balancer without any health checks and it's accepting requests from the outside but there's nothing in the backend to process them).
That is just an exmaple - without more details I'm not even speculate further.
If you want to read more about the messages take the second one from the top - documentation for compute.v1.BackendServicesService.delete.
You can also explore other Compute Engine API methods to see what they do to give you more insight what is happening with your project.
This should give you a good starting point to explore the API.
Now - regarding logs. Just navigate to Logs Viewer and select as a resource whatever you want to analyse (all or a single VM, Load Balancer, firewall rule, etc). You can also include (or exclude) certain level of logs (warning, error etc). Pissibilities are endless.
Your query may look something like this:
Here's more documentation on GCP Logs Viewer to help you out.

Finding untraced time in Google Cloud Tracer Agent for Express.js

I'm using Google Cloud's Stackdriver Trace Agent with the Express.js plugin.
I noticed there are a few routes which have substantial "untraced" time. What strategies can I use to find and begin to measure these untraced paths, and why would it not pick up certain code paths?
If the Trace agent isn't working, there's unfortunately not very much you can do to modify its behavior. I recommend using OpenCensus to instrument your application, which will give you much more control over exactly how traces and spans are created.

Stackdriver Logging Client Libraries - What happens during Google Downtime?

If you embed the Stackdrvier client library in your application and the Google stack driver API has downtime (Google documentation indicates 99.95% or 21.92 minutes of downtime/month)
My question is: What will happen to my application during the downtime? Will logging info build up in memory? Will it cause application errors or will it discard the log data and continue on?
Logging API downtimes can have different root causes and consequences. Google System Engineers have mechanisms in place to track and take mitigation actions so the downtime and its consequences are minimal but Google cannot guarantee data loss prevention in all outages all the time related to logging API.
Hopefully your application and pipeline can withstand up to (21.56 minutes) expected downtime a month (SLA 99.95%) as per the internal SLOs and SLAs of GCP.
The three scenarios you listed are plausible. In this period, your application sending the logs may have 500 responses from the network so it has to be able to deal with this kind of issue.
If the logging data manages to reach Google's platform but an outage prevents the data to be accessible, then Google's team will try their best to release backlogs, repopulate data, etc. They will post general notice on https://status.cloud.google.com/
If the issue is caused by the logging agent not sending data to our platform, then logging data may not be retrievable (but it could still be an infrastructural outage with one of the GCP products) or linked to something other than an outage like your application or its underlying host running out of resources or the logging agent being corrupted which is not covered by GCP Stackdriver SLA [1].
If the pipeline that ingests data from Logging API is backlogged, it could cause an outage but GCP team will try their best to make the data accessible after the outage ends.
If you suspect issues with Logging API malfunctioning, please contact support or file issue tracker or inspect open issues where Google's product team will provide updates live. Links below:
[1] https://cloud.google.com/stackdriver/sla#sla_exclusions
[2]
create new incident:
https://issuetracker.google.com/issues/new?component=187203&template=0
[3]
open issues:
https://issuetracker.google.com/savedsearches/559764

Is there a way to tell who started an instance in Google Cloud Platform?

We run only a small handful of instances on Google Cloud Platform and we don't run them all the time. Generally we just fire one up, do what we need to do then shut it down... which is great, except when "we" forget to shut them down.
I've been able to track down the relevant REST APIs and the gcloud sdk but I don't see anything that says who started the instance. Actually it also doesn't have a timestamp on when it was started.
I did find this python app engine script that I might be able to rewrite to stop the instances after X amount of time, but I'd rather find a way to notify the user who started it and let them know the instance is still running.
Has anyone tried to do something similar or seen a way to get the "starter" of the instance in GCP?
You can look into the Audit Logs to determine who did what, where, and when. Further, you can use the Stackdriver Logging API method entries.list to retrieve audit log entries for your use case.
Also you can choose use the Activity Logs to know the details such as the authorized user who made the API request.
With the new API you have to filter on the following:
resource.type="gce_instance"
resource.labels.instance_id="ID"
protoPayload.methodName="v1.compute.instances.start"

Google Cloud Pub/Sub on Google App Engine hits QPS limit too soon

Around 90 or 100 calls per second to
pubsub_client.projects().topics().publish(topic='projects/xxxx',body=body).execute(num_retries=0)
per second from Google App Engine App to Google Cloud Pub/Sub, results in
HttpError: <HttpError 429 when requesting https://pubsub.googleapis.com/v1/projects/xxxx:publish?alt=json returned "Request throttled due to user QPS limit being reached.">
I know there is a limit on administrative operations at 100 QPS, but certainly publishing to a topic is not an administrative operation? I know pub/sub should support millions of operations per second so I know there's something wrong.
Any help or insight would be appreciated. I need to get up to at least 300 publishes per second, trying to streamline an existing implementation using pubsub. I think this may be a bug with the implementation.
I am running this code on Google App Engine python 2.7 -- using the appengine runtime, not the flexible one as that's not approved for production code yet.
Note that publisher quota is not in terms of QPS, but in terms of throughput. The default limit is 100MB/s. See the Quotas documentation for more details. Depending on the message size you are sending, you may be running into these limits.
The "user QPS limit being reached" message on a publish usually means one of three things:
You are publishing at a throughput that is higher than the default 100MB/s quota. If that is the case, then you can apply for more quota by clicking on the "Apply for higher quota" on the Pub/Sub Quota page.
You are not authenticated against the correct Cloud project. If you are authenticated in or running your Google App Engine instances in a Cloud project that differs from the one your topic is defined in, the quota you run into may not be defined in the project you expect. More information can be found in the Google Application Defaults Credentials page.
You have manually set quota in the Quota page and that is the limit you are running into.