GCP log-based metrics error: REQUEST_CONTEXT_READ - google-cloud-platform

Loading log-based metrics suddenly stopped working with this error:
An error occurred requesting data. Can not find metric metric_types: "logging.googleapis.com/user/<redacted>" resource_container_ids: <redacted> request_context: REQUEST_CONTEXT_READ
Alerts defined on this metric are still triggering properly.
Any idea what can be the cause of this error? And since it was working before, perhaps this is a GCP bug?

Related

Dataflow job throws error when publishing to Pub/Sub topic

I have streaming Dataflow job that sink the output data to a pub/sub topic. But randomly the job logs throws an error:
There were errors attempting to publish to topic projects/my_project/topics/my_topics. Recent publish calls completed successfully: 574, recent publish calls completed with error: 1.
there is no stack trace provided by the dataflow, and from the job metrics the error type is "unavailable". After some time, the error will stopped and the pipeline still running as usual. Does this error occurs because of internal error in the GCP service, or because of quota issue? The output request was peaked at 10 req/s.
Found the similar issue, it resolved by adding the "Pub/Sub Admin" permission to Compute Engine default service account under IAM permissions.

How do I send a notification to Slack from AWS CloudWatch on a specific error?

I'm trying to setup notifications to be sent from our AWS Lambda instance to a Slack channel. I'm following along in this guide:
https://medium.com/analytics-vidhya/generate-slack-notifications-for-aws-cloudwatch-alarms-e46b68540133
I get stuck on step 4 however because the type of alarm I want to setup does not involve thresholds or anomalies. It involves a specific error in our code. We want to be notified when users encounter errors when attempting to login in or sign up. We have try/catch blocks in our Node.js backend to log errors to CloudWatch at various points in the login/signup flow where we think the errors are most likely happening. We would like to identify when those SPECIFIC errors are occurring and send a notification to a Slack channel built for this purpose.
So in step 4 of the article, what would I have to do to set this up? Or is the approach in this article simply the wrong one for my purposes?
Thanks.
The step 4 titled "Create a CloudWatch Alarm" uses CPUUtlization metric to trigger an alarm.
In your case, since you want to use CloudWatch Logs, you would create CloudWatch Metric Filters based on the logs entries of interest. This would produce custom metrics based on your error string. Subsequently, you would create CloudWatch Alarm of this metric as shown in the linked tutorial for CPUUtlization.

We are investigating increased API error rates in the us-east-1 Region

I am new to using Opscentre. I got this Alarm saying
"Issue detected for EC2 infrastructure"
Description : "We are investigating increased API error rates in the us-east-1 Region"
I Don't have any idea of what is it and not able to get a clear description of what this issue about.
Can anyone help me out.
This will just be AWS events alerting you that there is an ongoing incident involved in that region.
You can view more information in your personal health dashboard.

Stackdriver API metrics show many 429 errors, while quota is not exceeded

Earlier this month we've enabled Stackdriver Monitoring in 3 our projects on GCP.
Recently we've found that Stackdriver API metrics show around 85% of errors:
On graphs, these error codes are 429:
I've checked Quotas, everything seems fine:
Next metrics graph tells us what method causing errors:
Using the other graph "Errors by credential" I found out that API requests made by our GKE service account. We have custom service account for GKE instances, and as far as we know it has all required permissions for monitoring:
roles/logging.logWriter
roles/monitoring.metricWriter
roles/stackdriver.resourceMetadata.writer (as noted in this issue)
Also, stackdriver-metadata-agent pods in GKE cluster logs related error every minute:
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent I0203 15:03:16.911940 1 binarylog.go:265] rpc: flushed binary log to ""
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent W0203 15:03:56.495034 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent I0203 15:04:16.912272 1 binarylog.go:265] rpc: flushed binary log to ""
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent W0203 15:04:56.657831 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
Aside from that I haven't found any logs related to the issue yet, and I cannot figure out who does 2 requests per second to Stackdriver API receiving 429 errors.
I should add that everything above is true for all 3 projects.
Can someone suggest how can we solve the issue?
Is this still an excess of the quota? If yes, why request metrics for quotas are ok Quota exceeded errors count contains no data?
Are we missing any permissions on our GKE service account?
What else can be related?
Thanks in advance.
This is a known behavior where container and pods tend to publish updates very frequently and that hitting the rate limits. There's no performance or functionality issues with this behavior except getting noisy logs.
Its also possible to apply logs exclusion to avoid getting them posted on Stackdriver logging.

AWS Lambda throttled, but no evidence in the metrics

When running Lambda in high concurrency, I receive error CloudWatch logs below (which I haven't seen anywhere else on the web!).
Execution failed due to configuration error: Lambda was throttled while using the Lambda Execution Role to set up for the Lambda function
When I check the "Throttled invocations" metric, it doesn't show these throttles.
Why doesn't the metric show these throttles? Has anyone seen this throttle error before? It is not the usual throttle error.
For me it was API Gateway throttling that caused this issue even though I wasn't even near throttling limits - after removing limit on API GW(going back to default settings) I haven't faced this issue anymore