I have been trying to register an alert on spike of some metrics using Stackdriver. Here's the query and details:
If there a sudden spike and 500s cross 20
If the total number of requests (200s or others) cross 3000 over 5 mins
To achieve [1], I put the aggregation as mean, aligner as mean (sum over aligner doesn't seem to work - I dont understand why). This query works if the average of requests over 5 mins is over 20 (which is the expected behavior). But I am not able to register any single spike which is the requirement.
Again, for [2] the average over a certain duration works but the summation of requests doesn't seem to work.
If there a way of achieving either or both of the requirements.
PS: Please let me know if you need more data or snippets of the dashboard to understand what I have done till now. I will go ahead and add some accordingly.
I do not believe there is aggregation when trying to set up an alert. As an example for [1], please go to
Stackdriver Monitoring
Alerting
Create a policy and add your conditions
Select your Resource Type
Select your metric, condition and threshold = 20
Response_code_class = 500
Save condition
The alerting UI has changed since the previous answer was written. You can now specify aggregations when creating alerting policies. That said, I don't think you want mean; that's going to smooth out your curve which will defeat your intended use case. A simple threshold alert with a short duration (even zero) ought to do it, I think.
For your second case, you ought to be able to compute a five-minute sum and alert on that. If you still can't get it to work, respond here or file a support ticket and we'll see how we can help you.
Aaron Sher, Stackdriver engineer
Related
I have been having some difficulty understanding how to go about the ideal threshold for few of our cloudwatch alarms. I am looking at metrics for error rates, fault rate and failure rate. I am vaguely looking at having an evaluation period of around 15 mins. My metrics are being recorded at a minute level currently. I have the following ideas:
To look at the avg of minute level data over a few days, and set it slightly higher than that.
To try different thresholds (t1,t2 ..) and for a given day, see how many times the datapoints are crossing it in 15 min bins.
Not sure if this is the right way of going about it, do share if there is a better way of going about the problem.
PS 1: I know that thresholds should be based on Service Level Agreements(SLA), but let's say we do not have an SLA yet.
PS 2: Also does can I import data from cloudwatch to excel for some easier manipulation? Currently looking at running a few queries on log insights to calculate error rates.
In your case, maybe you could also try Amazon CloudWatch Anomaly Detection instead of static thresholds:
You can create an alarm based on CloudWatch anomaly detection, which mines past metric data and creates a model of expected values. The expected values take into account the typical hourly, daily, and weekly patterns in the metric.
Is there a way to see up-to-date DynamoDB total read and write capacity usage, maybe by day, and ideally by table? I know basically how to calculate it, and I can pull the value on responses, but I’m just barely starting to try the service out to see if it’s feasible to use, and I’d like to just throw a bunch of data and a bunch of queries at it, then see overall how much that’s going to cost, without waiting for the monthly bill.
Is this possible, or would I just need to keep track of all of my individual requests and add the results up?
CloudWatch can show you the ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits metrics per table, and a granularity of up to 1 hour.
Using the console, go to the CloudWatch service, and choose 'Metrics' in the left hand menu.
From there you can choose 'DynamoDB' -> 'Table Metrics' -> and then choose which table you want capacity usage of.
I deployed a Google Cloud Run service running in a docker container. Out of the box, it looks like I get insight into some metrics on the Metrics tab of the service page such as Request count, Request latencies and more. Although it sounds like request count would answer my question, what I am really looking for is insight into adoption so that I can answer "How many visits to my application were there in the past week" or something like that. Is there a way to get insight like that out of the box?
Currently, the Request count metric reports responses/second, so I can see blips that look like "0.05/s", which can give me some insight but it's hard to aggregate.
I've tried using the Monitoring > Metrics explorer as well, but I'm not seeing any data for the metrics I select. I'm considering hooking into Google Analytics from within my application if that seems like the suggested solution. Thank you!
I've realized it's quite difficult to have Metrics Explorer give you a straight answer on "how many requests I received this month". However, it's possible:
Go to Metrics Explorer as you said, choose resource type "Cloud Run Revision" (cloud_run_revision) and you'll see "Request Count" (run.googleapis.com/request_count) metric:
Description: Number of requests reaching the revision. Excludes requests that are not reaching your container instances (e.g. unauthorized requests or when maximum number of instances is reached).
Resource type: cloud_run_revision
Unit: number Kind: Delta Value type: Int64
Then, choose Aggregator: None, and click Show Advanced Options. In the form, choose Aligner: sum (instead of default "Rate" default). You now should be able to see total request count per minute:
Now if you change "Alignment Period" to "10 minutes", you'll see one data point for every 10m (sadly, there seems to be a bug that says X req/s, but that's more like X reqs/10m in this case):
If you collect enough data, you can change "Alignment Period" to "Custom" and set 30 days, then update your timeframe on the top to 1 year and see monthly request count.
This does not show sums of all Alignment Periods (I think that part is up to you to do manually, maybe possible via the API), but it lets you see requests per month. For example, here's a service I've been running for some months and I set alignment period to 7 days, viewing past 6 weeks, so I get 6 data points on weekly request count. Hope this helps.
I have a custom metric of type Count, which measures the count of a particular operation. It has a label called "success", which can be either "Success" or "Failure". I'd like to create an alert condition if the Failure % is above a certain threshold, perhaps 20%. Is that possible? If so, how would I do that? Or, do I need to change the metric itself to support this, and if so, how?
You can personalize your stackdriver alerting by targeting this labels with condition triggers where you will be able to set the percent of time series violates as you want 20%. You can follow this guide to accomplish what you want.
I think what I may need is to create a "metric ratio":
https://cloud.google.com/monitoring/alerts/policies-in-json#json-ratio
With the API, you can create a policy that computes the ratio of two
related metrics and fires when that ratio crosses a threshold.
But somewhat unfortunately:
Note: You can't create ratio-based policies through the UI.
My stream analytics job is getting data for last 24 hours
There is a lot of data to look at here, and whilst this worked for a while, its now stopped generating output events
This prevents data being sent to power bi
I only want the last 24 hours of data to be shown in Power BI
How can I do this?
I have tried to reduce the time window, but I dont want to do that as a fix.
SELECT [Timestamp], Broker, Price, COUNT(*)
INTO [powerbi2]
FROM [eventhubinput] TIMESTAMP BY [TimeStamp]
GROUP BY [TimeStamp], Broker, Price, TUMBLINGWINDOW(hh,3)
The query looks correct. There are a couple things that could be happening here:
Your PowerBI account is being throttled (see here for limits on data ingress). If this is occurring, there should be warnings in your job's Activity Log. If this is the case, you may have to decrease the rate of your job's egress and/or upgrade your PowerBI account.
Your job is falling behind your Event hub due to the high rate of ingress. You can check this by looking at the Input Events Backlogged metric in the Portal. If this is the case, scaling your job may help.
If neither of these suggestions help, I'd recommend reaching out to Azure support so the team can take a closer look.