GCP alerting policies based on percentage - google-cloud-platform

I am trying to create some alerting policies in GCP for my application hosted in Kubernetes cluster.
We have a Cloud load balancer serving the traffic and I can see the HTTP status codes like 2XX, 5XX etc.
I need to create some alerting policies based on the error percentage rather than the absolute value like ((NumberOfFailures/Total) * 100) so that if my error percentage goes above say 50% then trigger an alert.
I couldn't find anything on the google documentation. It just tells you to use counter which is like using an absolute value. I am looking for something like if the failure rate goes beyond 50% in a rolling window of 15 minutes then trigger the alert.
Is that even possible to do that natively in GCP?

Yes, I think this is possible with MQL. I have recently created something similar to your use case.
fetch api
| metric 'serviceruntime.googleapis.com/api/request_count'
| filter
(resource.service == 'my-service.com')
| group_by 10m, [value_request_count_aggregate: aggregate(value.request_count)]
| every 10m
| { group_by [metric.response_code_class],
[response_code_count_aggregate: aggregate(value_request_count_aggregate)]
| filter (metric.response_code_class = '5xx')
; group_by [],
[value_request_count_aggregate_aggregate:
aggregate(value_request_count_aggregate)] }
| join
| value [response_code_ratio: val(0) / val(1)]
| condition gt(val(), 0.1)
In this example, I am using the request count for a service my-service.com. I am aggregating the request count over the last 10 minutes and responses with response code 5xx. Additionally, I am aggregating the request count over the same time period, but all response codes. Then in the last two lines, I am computing the ratio of the number of 5xx status codes with the number of all response codes. Finally, I create a boolean value that is true when the ratio is above 0.1 and that I can use to trigger an alert.
I hope this gives you a rough idea of how you can create your own alerting policy based on percentages.

Related

Google Cloud managed service for Prometheus consistently ingests wrong values for certain metrics

We set up google cloud managed service for prometheus this week. While creating grafana dashboards I noticed that most metrics were ingested correctly, but some values were consistently incorrect.
The output of the metrics endpoint looks like this (truncated):
# HELP channel_socket_bytes_received_total Number of bytes received from clients # TYPE channel_socket_bytes_received_total counter # HELP event_collection_size Number of elements # TYPE event_collection_size gauge event_collection_size{name="interest"} 18 event_collection_size{name="rfq"} 362 event_collection_size{name="negotiation"} 12 # TYPE sq_firestore_read_total counter sq_firestore_read_total{collection="negotiation"} 12 sq_firestore_read_total{collection="rfq_interest"} 18 sq_firestore_read_total{collection="rfq"} 362
The output on this endpoint is generated by "prom-client": "14.1.0".
Google managed service for prometheus ingests these metrics. Almost all of them work as expected. But the sq_firestore_read_total metric is consistently wrong.
The google cloud metrics explorer shows these values:
Services were restarted a number of times. Once the value of one label reached 3, but more common is that the values of all 3 labels of the metric stay stuck at 0.
It seems to me that something goes wrong during the ingestion stage. Is this a bug in the google cloud managed service for prometheus?
Important to reiterate: The values I expect are 12, 16, and 362. The values that are ingested are either 0 and sometimes 3.

Is there a way to easily get only the log entries for a specific AWS Lambda execution?

Lambda obviously tracks executions, since you can see data points in the Lambda Monitoring tab.
Lambda also saves the logs in log groups, however I get the impression that Lambda launches are reused if happening in a shorter interval (say 5 minutes between launches), so the output from multiple executions gets written to the same log stream.
This makes logs a lot harder to follow, especially due to other limitations (the CloudWatch web console is super slow and cumbersome to navigate, aws log get-log-events has a 1MB/10k message limitation which makes it cumbersome to use).
Is there some way to only get Lambda log entries for a specific Lambda execution?
You can filter by the RequestId. Most loggers will include this in the log, and it is automatically included in the START, END, and REPORT entries.
My current approach is to use CloudWatch Logs Insights to query for the specific logs that I'm looking for. Here is the sample query:
fields #timestamp, #message
| filter #requestId = '5a89df1a-bd71-43dd-b8dd-a2989ab615b1'
| sort #timestamp
| limit 10000

Is it possible to set up CloudWatch Alarm for 3 or 4 mins period?

I need to receive a notification each time a certain message does not appear in logs for 3-4 minutes. It is a clear sign that the system is not working properly.
But it is only possible to choose 1 min or 5 mins. Is there any workaround?
"does not appear in logs for 3-4 minutes. It is a clear sign that the system is not working properly."
-- I know what you mean, CloudWatch Alarm on a metric which is not continuously pushed might behave a bit differently.
You should consider using Alarm's M out of N option with 3 out 4 option.
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-alarms-now-alerts-you-when-any-m-out-of-n-metric-datapoints-in-an-interval-are-above-your-threshold/
Also, if the metric you are referring to was created using a metric filter on a CloudWatch Log Group, you should edit the metric to include a default value so that each time a log is pushed and the metric filter expression does not match it still pushes a default value (of say 0) thus making metric have more continuous datapoint.
If you describe an cloudwatch alarm using AWS Cli it is possible to input the period in seconds.Only the web interface limits the period to set of values.
https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/describe-alarms.html

ResourceExhausted: 429 Quota exceeded for quota metric Natural Language API through Dataflow using Python SDK

I am building Dataflow pipleline to read fro CSV , perform Sentiment analysis through Google Cloud NLP API and send teh result to BigQuery.
when the function that perform sentiment analysis get the pcollection is gives me the above mentioned error.
What I am thinking about is splitting the Pcollection into small Pcollection in order handle Quote limitation in NLP API.
(p
| 'ReadData' >> beam.io.textio.ReadFromText(src_path)
| 'ParseCSV' >> beam.ParDo(Analysis())
| 'WriteToBigQuery' >> ...
)
I assume you have auto-scaling turned on as it is on by default. Try turning it off and then setting the working count limit to something small like 5. This will provide a cap to the number of underlying worker threads processing the bundles. From there you can play around with instance type (number of cores) in order to maximize your throughput.
The default limit is 600 requests per minute which is pretty low. You can also request a quota increase for NLP. My advice is to do both fixed pool to throttle and then ramp up quota to dial in your wall clock time goal.

"Android Device Verification" Service quota usage

I'm using Android Device Verification service (SafetyNet's attestation api), for verifying whether the request is sent from the same app which I built.
We have a quota limit of 10,000 (which can be increased) on the number of request we can do using SafetyNet's attestation api.
Now, I want to know if my limit is breached so that I can stop using that API.
For that I was looking into stack-driver alerting but I couldn't find Android Device Verification service in it. (Even though I was able to find it in Quotas)
You can monitor Safetynet Attestations in Stackdriver by specifying these filters and settings:
Resource Type: Consumed API
Metric: Request count (type search term "serviceruntime.googleapis.com/api/request_count" to find correct count quickly)
Add Filter service = androidcheck.googleapis.com
Use aggregator "sum" to get the total count of requests.
You can set advanced aggregation options to aggregate the daily level to compare with your quota. This can be done by setting Aligner: "sum" and Alignment period: "1440m". This gives daily sums of requests for the chart (1440m=24h*60m = number of minutes per day).