Currently I am trying to setup a Alert policy in GCP where I want to compare the current elapsed time of my dataflow job with the elapsed time of the previous dataflow job and fire an alert if the current job has a 50% bigger elapsed time.
Is it possible to do?
Thank you
You can create alerting policies based on metric absence and metric threshold conditions. You can take a look at this documentation for types of alerting policies you can create. It seems the feature you are looking for is not currently supported. However, you can report for Feature request using this documentation
One option would be to use a rate of change condition.
https://cloud.google.com/monitoring/alerts/types-of-conditions#metric-threshold
I don't know if it's exactly what you're looking for, but it should let you get alerts on big changes between runs.
Related
I'm hoping to configure some form of alerting for AWS Glue Jobs when they run longer than a configurable amount of time. These Glue jobs can be triggered at any time of day, and usually take less than 2 hours to complete. However if this exceeds the 2 hour threshold, I want to get a notification for this (via SNS).
Usually I can configure run time alerting in CloudWatch Metrics, but I am struggling to do this for a Glue Job. The only metric I can see that could be useful is
glue.driver.aggregate.elapsedTime, but it doesn't appear to help. Any advice would be appreciated.
You could use the library for that. You just need the job run id and then call getJobRun to get the execution time. Based on that you can then notify someone / some other service.
I want to create an alarm for a particular time window. So, the use case is if we see customer/traffic drop from 6:00 AM to 10 PM then we should get an alarm to know why customers are not using our service and to take some action. is this scenario possible through cloudwatch alarm? we have the number of request metric in place.
Amazon CloudWatch cannot specify time ranges, but since you want to know whether something "unusual" is happening, I would recommend you look at Using CloudWatch Anomaly Detection - Amazon CloudWatch:
When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention.
See: New – Amazon CloudWatch Anomaly Detection | AWS News Blog
It should be able to notice if a metric goes outside of its "normal" range, and trigger an alarm.
Is there a way in Stackdriver to create an alert based on the absence of a specific line in the logs for a specific timeframe (say 1 hour) ?
I am trying to have a way to monitor (and be notified) whether a GKE CronJob did not run in the last hour. (was not able to come up with any other way of achieving this)
You can create a log based metric in regards to a specific log entry following the steps here. Once that is created, you can create an alert based off of that log based metric following the instructions here.
You could configure the Alert to trigger when its below a certain threshold for a certain amount of time; however, you cannot define a certain time frame for the alert policy to run. The alert policy will run until it is deleted.
I run a Google Cloud dataflow job. I know how to monitor elementCount metric coming from it. But that metric shows me the total number of events processed by the job from its start. But how to monitor the rate? Like events per timespan, per minute in Stackdriver?
Ideally, I would like to apply a simple transformation on the elementCount metric inside the Stackdriver. But I'm afraid I would need to send a separate metric computed in the Dataflow job...
You can access all the stackdriver metrics via the API (although the elementCount is a gauge, you can fetch the time series). Here are all the dataflow metric in StackDriver:
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow
Probably you need todo some calculations on the timeseries if you want to have the correct rate per time windows.
The API timeseries documentation is here:
https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3
You can even access the API's in your dataflows. Note, that I think the way the metrics is used it should have been a counter.
Having created log-based metrics in cloud console, I then want to create alerts so that every time there is a new matching log entry, the alert triggers.
In trying to create a suitable metric, the most likely looking options seem to be threshold or rate of change, but I don't think either will work for a policy of 1 log message => 1 alert.
Help appreciated.
Yes, today the only way to alert on the log message is to create a threshold condition on the log metric with a very small threshold (0.001) and a duration of 1 minute.
Thanks for using Stackdriver.
You can use another alert triggering software (like PagerDuty) which is pinged by emails sent by Stackdriver. PagerDuty is able to filter all those emails that has the word RESOLVE in its subject. They can be absolutely thrown away in our case if you'd like to avoid autoresolving. Of course Stackdriver and PagerDuty alerts will diverge from each other (states will be inconsistent) but you should consider PD as single source of truth this time. It could be a possible workaround.
With log-based alerts, we can create alerts from the logs, incident will be creatied for each matching entry.
https://cloud.google.com/blog/products/operations/create-logs-alerts-preview