Define a Cloudwatch alert on replication lag - amazon-web-services

I have a Master-Slave configuration on AWS RDS MySQL.
I want to set an alert when the replication lag goes above a certain threshold (e.g. 10 seconds).
How can it be done?
If it is not possible, is there another way to achieve similar result? (without using 3rd party tools / custom scripting)

You can track replica lag using the ReplicaLag metric on your slave instance. Note that this metric is measure in seconds. This metric is reported automatically by RDS every minute.
You can create a CloudWatch alarm to monitor the ReplicaLag metric. You should set this alarm to be breaching if the sum of ReplicaLag over an evaluation period of 1 minute is greater than 0.

Related

To find underutilized instances list based on CPU and network utilization

I would like to get all the AWS EC2 instances list which are idle and underutilized. I need to filter them based on CPU utilization used less than 2% and network I/O less than 5Mb for last 30 days.
Can you please provide me with the commands or any scripts to get the list or guide me to get that achieved.
I need this to get that list and terminate those instances for cost management.
You can create a CloudWatch alarm that is triggered when the average
CPU utilization percentage has been lower than 2 percent for 24 hours,
signaling that it is idle and no longer in use. You can adjust the
threshold, duration, and period to suit your needs, plus you can add
an SNS notification so that you will receive an email when the alarm
is triggered.
Kindly refer to this documentation to create an alarm to terminate an idle instance using the Amazon CloudWatch console.

How to increase resolution of CpuUtilization metric of ECS cluster past 1 min mark?

I'm trying to create a robust autoscaling process for my ECS cluster but am facing problems with resolution of CpuUtilization metric. I have turned on 'Detailed metrics' for 1-min resolution, but am not able to achieve good scaling results. I am deploying an ML model which takes roughly 1.5s to infer. I am not facing any memory bottleneck and hence, am using CpuUtilization for scaling.
I need fast scaling as when requests start piling up the response time easily shoots up to 3-5s. Currently, with 'Detailed Metrics' enabled. The scale-out time takes around 3-5 miuntes to start as 3 datapoints are checked for 1-min res metrics. If I have 5-10s res metric, then I can look at 6 data points within 30s and start the scale-out job faster.
I tried using Lambda, StepFunctions and EventBridge from this blog. But, I am not able to get CpuUtilization or MemoryUtilization, only the task, service and container counts.
Is there a way to get Cpu and Memory metrics directly from ECS? I know we can use cloudwatch.get_metric_statistics(). But, we can only get datapoints that are reported to CloudWatch. So, not useful.
You can't change that. 1 min value is set by AWS. The only thing you can do to get better resolution is to create your own custom metrics. Custom metrics can have resolution of 1 second.

Auto scale rule based on custom Cloudwatch alarm

I have an auto-scaling group of EC2 servers that run a number of processes.
This number of processes changes with the load and I'd like to trigger a scaling (up/down) based on the number of processes.
I've successfully set up a script that sends to Cloudwatch the number of processes on every servers, for every minutes, and I can see these on Cloudwatch. (I haven't set a dimension, to be able to get the value for all the servers).
Then, I created an Alarm, that uses the average for the values sent, and if it reach a certain limit, it triggers the "Add a new server" to the auto scaling group, and when it stop being on alarm, it triggers a "Remove a server".
My issue is that when I add the new server, the average drops, since there is one more server now, which move the alarm to the ok state, removing the server, and increasing again the average, triggering again the alarm, etc.
For instance, the limit is set to 10 processes on average. With 3 servers, if the average becomes 11, I trigger the alarm state, adding a server. Now with the new server, I'm at 33 processes (3 x 11) for 4 servers : 8,25 processes on average, thus triggering the "OK" alarm.
My question is: Is it possible to set up an alarm based on the number of processes without having the new trigger causes a up-down-up-down issue?
Instead of average, I can use something else to trigger the alarm, such as min/max/I-don't-know.
Thank you for your help. Happy to provide any other details if needed.
You should not create an alarm that adds instances when True and removes instances when False. This will cause a continual 'flip-flop' situation rather than trying to find a steady-state.
You could have each server regularly send a custom metric to Amazon CloudWatch. You could then use this with Target tracking scaling policies for Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling, which will calculate the average value of the metric and automatically launch/terminate instances to keep the target value around 10.
This would work well with long-running processes (perhaps 5+ minutes with several processes running concurrently), but would not be good with short sub-minute processes because it takes time to launch new instances.
I think you could look at metric math. So instead of directly triggering your alarm based on your process-count-metric only, you could perhabs calculate the average count yourself using metric math. You could use the GroupTotalInstances metric from your ASG, or just publish second custom metric having the number of instances.
In both cases, your metric for the alarm would use metric math to divide number of processes by size of ASG for each evaluation period.

AWS High Resolution Metrics for faster ECS scaling

I have a complex REST API deployed in AWS ECS. The autoscaling policy for the same is based on RequestCount of 2000.
The scale out will happen when RequestCount is consistently higher than 2000 with standard resolution per 60 seconds. This takes at least 2 minutes before scaling happens. This is becoming a problem with short-time request surge when request count increases to 10k and above. The containers start rejecting requests(throttling).
I need to at least make the scaling happen more quickly within a minute if not within seconds. AWS CloudWatch seems to offer High-Resolution metrics, but there's very less information about:
Can I enable specific metrics with high-resolution. Is it possible that I can have request counts resolved at high granularity of 5 seconds and CPUUtilization at standard granularity of 1 minute?
How can I enable high resolution on AWS metrics?
The AWS CloudWatch Documentation seems to be insufficient to understand this process.
There's two different things that can be 'high resolution', the alarm and the metric.
A High Resolution metric just means the source is pushing values more frequently. You can't control this if your using an AWS metric, and most of them don't push more often than once a minute.
A High Resolution alarm is one where the period is less than 60 seconds and will be billed at a higher rate than standard alarms. However, this isn't very useful in most cases if the metric your basing it on only gets pushed once per minute
EDIT:
To directly answer your questions
No, I don't think any of the AWS RequestCount metrics for things like ELB have a 'high resolution on/off' toggle (although ELB might push more frequently than 1 minute by default, I'm not sure)
its based on how often the source pushes data points to cloudwatch. If the AWS metrics don't work for what you need, you would need to add something like the CloudWatch agent (or just a script in your instance) pushing metric more frequently. Be careful about the CloudWatch API call charges if you do this from a lot of sources at a high frequency though

Amazon RDS: plot number of instances in CloudWatch

How do I plot the number of instances in my AWS Aurora RDS cluster over time in CloudWatch?
There doesn't seem to be a metric for that.
Indeed there is no metric for that.
UPDATE: the below trick is not 100% foolproof: when the dashboard range is set to 1d or more, the display period automatically changes to 5 Minutes, which leads to values being off by a factor of 5.
The trick is to pick any RDS aggregated metric (for example CPUUtilization, aggregated per DB role), then select Statistic: Sample Count and Period: 1 Minute.