I have a custom metric of type Count, which measures the count of a particular operation. It has a label called "success", which can be either "Success" or "Failure". I'd like to create an alert condition if the Failure % is above a certain threshold, perhaps 20%. Is that possible? If so, how would I do that? Or, do I need to change the metric itself to support this, and if so, how?
You can personalize your stackdriver alerting by targeting this labels with condition triggers where you will be able to set the percent of time series violates as you want 20%. You can follow this guide to accomplish what you want.
I think what I may need is to create a "metric ratio":
https://cloud.google.com/monitoring/alerts/policies-in-json#json-ratio
With the API, you can create a policy that computes the ratio of two
related metrics and fires when that ratio crosses a threshold.
But somewhat unfortunately:
Note: You can't create ratio-based policies through the UI.
Related
I have been having some difficulty understanding how to go about the ideal threshold for few of our cloudwatch alarms. I am looking at metrics for error rates, fault rate and failure rate. I am vaguely looking at having an evaluation period of around 15 mins. My metrics are being recorded at a minute level currently. I have the following ideas:
To look at the avg of minute level data over a few days, and set it slightly higher than that.
To try different thresholds (t1,t2 ..) and for a given day, see how many times the datapoints are crossing it in 15 min bins.
Not sure if this is the right way of going about it, do share if there is a better way of going about the problem.
PS 1: I know that thresholds should be based on Service Level Agreements(SLA), but let's say we do not have an SLA yet.
PS 2: Also does can I import data from cloudwatch to excel for some easier manipulation? Currently looking at running a few queries on log insights to calculate error rates.
In your case, maybe you could also try Amazon CloudWatch Anomaly Detection instead of static thresholds:
You can create an alarm based on CloudWatch anomaly detection, which mines past metric data and creates a model of expected values. The expected values take into account the typical hourly, daily, and weekly patterns in the metric.
Is it possible to create a rule via SQL rule query statement that can take an action if the temperature is more than 15% higher than the average of all other devices. Or do we need to use other AWS service like AWS Lambda to achieve this? Just trying to get an advice on how approach this problem.
The IoT SQL rule would only be applied to the current event payload. Think of it like a way to filter events instead of a way to query a database. It would not have access to the "average of all other devices" value. You would need to build that yourself, possibly by storing all the device values in DynamoDB and calculating the average in a Lambda function.
My goal is to base my metrics directly from log values. The problem is when I display them as graph it looks like they are distributed. How can I change it so that it displays the values from the logs?
Unfortunately Stackdriver doesn't work in that way, you shouldn't expect that Stackdriver shows you "52" in this case. Have a look at the official documentation where "logs-based metrics can be one of two metric types: counter or distribution" and "counter metrics count the number of log entries matching" and "distribution metrics is to track latencies". You have to choose another tool for this task.
Assuming you created this as a distribution metric, I would expect this to work. Please take a look at this blog post to make sure you're using aligners and aggregators correctly.
I have been trying to register an alert on spike of some metrics using Stackdriver. Here's the query and details:
If there a sudden spike and 500s cross 20
If the total number of requests (200s or others) cross 3000 over 5 mins
To achieve [1], I put the aggregation as mean, aligner as mean (sum over aligner doesn't seem to work - I dont understand why). This query works if the average of requests over 5 mins is over 20 (which is the expected behavior). But I am not able to register any single spike which is the requirement.
Again, for [2] the average over a certain duration works but the summation of requests doesn't seem to work.
If there a way of achieving either or both of the requirements.
PS: Please let me know if you need more data or snippets of the dashboard to understand what I have done till now. I will go ahead and add some accordingly.
I do not believe there is aggregation when trying to set up an alert. As an example for [1], please go to
Stackdriver Monitoring
Alerting
Create a policy and add your conditions
Select your Resource Type
Select your metric, condition and threshold = 20
Response_code_class = 500
Save condition
The alerting UI has changed since the previous answer was written. You can now specify aggregations when creating alerting policies. That said, I don't think you want mean; that's going to smooth out your curve which will defeat your intended use case. A simple threshold alert with a short duration (even zero) ought to do it, I think.
For your second case, you ought to be able to compute a five-minute sum and alert on that. If you still can't get it to work, respond here or file a support ticket and we'll see how we can help you.
Aaron Sher, Stackdriver engineer
Is it possible to count number of occurrences of a specific log message over a specific period of time from GCP Stackdriver logging? To answer the question "How many times did this event occur during this time period." Basically I would like the integral of the curve in the chart below.
It doesn't have to be a moving window, this time it's more of a one-time-task. A count-aggregator or similar on the advanced log query would also work if that would be available.
The query looks like this:
(resource.type="container"
logName="projects/xyz-142842/logs/drs"
"Publish Message for updated entity"
) AND (timestamp>="2018-04-25T06:20:53Z" timestamp<="2018-04-26T06:20:53Z")
My log based metric for the graph above looks like this:
My Dashboard is setup like this:
I ended up building stacked bars.
With correct zoom level I can sum up the number of occurrences easy enough. It would have been a nice feature to get the count directly from a graph (the integral), but this works for now.
There are multiple ways to do so, the two that I saw actually working and that can apply to your situation are the following:
Making use of Logs-based Metrics. They can, for example, record the number of log entries containing particular error messages, or they can extract latency information reported in log entries.
Stackdriver Logging logs-based metrics can be one of two metric types: counter or distribution. [...] Counter metrics count the number of log entries matching an advanced logs filter. [...] Distribution metrics accumulate numeric data from log entries matching a filter.
I would advise you to go through the Documentation to check this feature completely cover your use case.
You can export your logs to Big query, once you have them there you can make use of the classical tools like groupby, select and all the tool that BigQuery offers you.
Here you can find a very minimal step to step guide regarding how to export the logs and how to Analyzing Audit Logs Using BigQuery, but I am sure you can find online many resources.
The product and the approaches are really different, I would say that BigQuery is more flexible, but also more complex to be configure and to properly use it. If you find a third better way please update your question with those information.
At first you have to create a metric :
Go to Log explorer.
Type your query
Go to Actions >> Create Metric.
In the monitoring dashboard
Create a chart.
Select the resource and metric.
Go to "Advanced" and provide the details as given below :
Preprocessing step : Rate
Alignment function : count
Alignment period : 1
Alignment unit : minutes
Group by : log
Group by function : count
This will give you the visualisation in a bar chart with count of the desired events.
There is one more option.
You can read your custom metric using Stackdriver Monitoring API ( https://cloud.google.com/monitoring/api/v3/ ) and process it in script with whatever aggregation you need.
If you are working with python - you may look into gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/monitoring
It will be very simple script and you can stream results of calculation into bigquery table and use it in your dashboard
With PacketAI, you can send logs of arbitrary formats, including from GCP. then the logs dashboard will automatically parse and group into patterns as shown in this video. https://streamable.com/n50kr8
Counts and trends of different log patterns are also displayed
Disclaimer: I work for PacketAI