Is it possible to have a moving average or a trend line in the metrics of aws cloudwatch?
The idea is to show for example the cpu utilization of a server over time and not just the average of the last x minutes, so we can see if the trend over long period of time is going up or down.
CloudWatch does not have trend lines built into standard metrics.
If you're looking for this you can enable this you would need to setup anomaly detection for the metric.
By enabling this you will be able to build up an overview of the trends for that metric whilst also configuring the normal/abnormal ranges for the metric. If your data goes outside this line you can have a CloudWatch alarm notify you.
Related
I'm trying to combine certain number of similar metrics into a single alarm in aws cloud watch. For example lets say for data quality monitoring in sagemaker, one among the metrics that are emitted from data quality monitoring job is feature baseline drift distance for each column so let say I've 600 columns so each column will have this metric. Is there a possible way to compress these metrics into a single cloud watch alarm ?
If not, Is there anyway to send the violation report as message via AWS SNS?
While I am not sure exactly on what out come you want when you refer to "compress the metrics into a single alarm." You can look at using metric math
I am monitoring a Fargate service on ECS and want to know when containers are bouncing a lot (come up, fail healthcheck, get killed by ECS and a new one gets scheduled and does the same)
I replicated the scenario I'm interested in and using the "Sample count" aggregation for CPUUtilization from ECS I can see this graph:
The value in an ideal world would be 1 but as we can see here ECS schedules a new container to replace the unhealthy one and that gets killed eventually and we see this bouncing behavior
I would like to set up a Cloudwatch alarm for this. When the value fluctuates a lot from the ideal value in a short period of time but I can't quite figure out if this is possible. Maybe with some metric math but I can't quite get it. I also looked into Anomaly Detection and I think that would work but it incurs extra cost that I don't think is warranted
Just looking to set off an alarm if value bounces around multiple y axis points in let's say a 5 minute time frame
You can do something like this with metric math:
RUNNING_SUM(ABS(DIFF(myMetric)))
This will return a rolling sum of all the absolute changes in the metric for your period. You can then create an alarm based on this, adjusting the desired period and threshold.
I want to create an alarm for a particular time window. So, the use case is if we see customer/traffic drop from 6:00 AM to 10 PM then we should get an alarm to know why customers are not using our service and to take some action. is this scenario possible through cloudwatch alarm? we have the number of request metric in place.
Amazon CloudWatch cannot specify time ranges, but since you want to know whether something "unusual" is happening, I would recommend you look at Using CloudWatch Anomaly Detection - Amazon CloudWatch:
When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention.
See: New – Amazon CloudWatch Anomaly Detection | AWS News Blog
It should be able to notice if a metric goes outside of its "normal" range, and trigger an alarm.
I have a complex REST API deployed in AWS ECS. The autoscaling policy for the same is based on RequestCount of 2000.
The scale out will happen when RequestCount is consistently higher than 2000 with standard resolution per 60 seconds. This takes at least 2 minutes before scaling happens. This is becoming a problem with short-time request surge when request count increases to 10k and above. The containers start rejecting requests(throttling).
I need to at least make the scaling happen more quickly within a minute if not within seconds. AWS CloudWatch seems to offer High-Resolution metrics, but there's very less information about:
Can I enable specific metrics with high-resolution. Is it possible that I can have request counts resolved at high granularity of 5 seconds and CPUUtilization at standard granularity of 1 minute?
How can I enable high resolution on AWS metrics?
The AWS CloudWatch Documentation seems to be insufficient to understand this process.
There's two different things that can be 'high resolution', the alarm and the metric.
A High Resolution metric just means the source is pushing values more frequently. You can't control this if your using an AWS metric, and most of them don't push more often than once a minute.
A High Resolution alarm is one where the period is less than 60 seconds and will be billed at a higher rate than standard alarms. However, this isn't very useful in most cases if the metric your basing it on only gets pushed once per minute
EDIT:
To directly answer your questions
No, I don't think any of the AWS RequestCount metrics for things like ELB have a 'high resolution on/off' toggle (although ELB might push more frequently than 1 minute by default, I'm not sure)
its based on how often the source pushes data points to cloudwatch. If the AWS metrics don't work for what you need, you would need to add something like the CloudWatch agent (or just a script in your instance) pushing metric more frequently. Be careful about the CloudWatch API call charges if you do this from a lot of sources at a high frequency though
I run a Google Cloud dataflow job. I know how to monitor elementCount metric coming from it. But that metric shows me the total number of events processed by the job from its start. But how to monitor the rate? Like events per timespan, per minute in Stackdriver?
Ideally, I would like to apply a simple transformation on the elementCount metric inside the Stackdriver. But I'm afraid I would need to send a separate metric computed in the Dataflow job...
You can access all the stackdriver metrics via the API (although the elementCount is a gauge, you can fetch the time series). Here are all the dataflow metric in StackDriver:
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow
Probably you need todo some calculations on the timeseries if you want to have the correct rate per time windows.
The API timeseries documentation is here:
https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3
You can even access the API's in your dataflows. Note, that I think the way the metrics is used it should have been a counter.