How to monitor all ec2 by CPU usage via CloudWatch - amazon-web-services

I am trying to set up monitoring of a large number of ec2s and their number is constantly changing. I would like the owner of this instance to receive a notification when the CPU usage is low for a long time.
I can create a function that would get a list of all ec2s, then get their CPU utilization, then send messages to the owners. This option does not suit me, since it takes some time to monitor the state, and not get the CPU utilization values per second of the function launch. And in general, this method looks bad.
I can set up alarm in cloudwatch, but only for one specific instance. This option is not suitable, since there are a lot of ec2 and their number varies.
I can create a dashboard with ec2 names and their CPU utilization. This dashboard will be dynamically replenished. But I haven't figured out how to send notifications from it.
How can I solve my problem without third-party solutions?

Please see this AWS document https://aws.amazon.com/blogs/mt/use-tags-to-create-and-maintain-amazon-cloudwatch-alarms-for-amazon-ec2-instances-part-1/
You will find some existing Lambda functions which will create Cloudwatch alert after creating EC2 instance automatically.
It looks a little bit tricky but worth seeing if you really want to make it automatic. But yes single cloud watch alert can't monitor multiple EC2 instances.
--
Another thing, same sample lambda function you will find from the existing template and it will directly create that lambda function and you can test it.

I have solved my problem. And it seems to me that this is one of the simplest options.
Using method get_metric_data from AWS SDK for Python Boto3 I wrote a function:
import boto3
from statistics import mean
from datetime import timedelta, datetime
cloudwatch_client = boto3.client('cloudwatch')
response = cloudwatch_client.get_metric_data(
MetricDataQueries=[
{
'Id': 'myrequest',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/EC2',
'MetricName': 'CPUUtilization',
'Dimensions': [
{
'Name': 'InstanceId',
'Value': 'i-123abc456def'
}
]
},
'Period': 3600,
'Stat': 'Average',
'Unit': 'Percent'
}
},
],
StartTime=datetime.now() - timedelta(days=1),
EndTime=datetime.now()
)
for MetricDataResults in response['MetricDataResults']:
list_avg = mean(MetricDataResults['Values'])
print(list_avg)
At the output, I get the average CPU usage as a percentage. For the specified time.
I'm still learning, but I'll try to answer your questions if there are any. Thank you all!

Related

Getting single time series from AWS CloudWatch metric maths SEARCH function

I'm attempting to creating a CloudWatch alarm for if any instances in a group go over x% of memory used and have built the following metric maths query to do so:
SEARCH('{CWAgent,InstanceId} MetricName="mem_used_percent"', 'Maximum', 300)
This graphs fine, however the CloudWatch console complains "The expression for an alarm must create exactly one time series.". I believe this is the case; The query above should (and does) return a singular line graph result that is not multi-dimensional.
How can I get this data to return in the format required by CloudWatch to create an alarm? My alternative is to general a new alarm per instance creation, however this seems more complex to manage the creation and destruction of alarms.
CloudWatch config on instance for collecting metric:
"metrics":{
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected":{
"mem": {
"measurement": [
"used_percent"
]
},
"disk": {
"measurement": [ "used_percent" ],
"metrics_collection_interval": 60,
"resources": [ "/" ]
}
}
Unfortunately it's not possible to create an alarm based on a search expression, so I don't think there's (currently) a way to do what you're after.
Per https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create-alarm-on-metric-math-expression.html:
You can't create an alarm based on the SEARCH expression. This is because search expressions return multiple time series, and an alarm based on a math expression can watch only one time series.
This appears to be the case even when you only get one result from a SEARCH expression.
I tried to combine this down into one time series using AVG, but this then appeared to lose the context of the metric and instead gave the error 'The expression for an alarm must include at least one metric'.
I'm currently handling a similar case with a pair of Lambda functions tied to CloudTrail events for RunInstances and TerminateInstances, that parse the event data for the instance ID and (among other things) create and delete individual CloudWatch alarms.
This example displays one line for each instance in the Region, showing the CPUUtilization metric from the AWS/EC2 namespace.
SEARCH(' {AWS/EC2,InstanceId} MetricName="CPUUtilization" ', 'Average', 300)
Changing InstanceId to InstanceType changes the graph to show one line for each instance type used in the Region. Data from all instances of each type is aggregated into one line for that instance type.
SEARCH(' {AWS/EC2,InstanceType} MetricName="CPUUtilization" ', 'Average', 300)
Removing the dimension name but keeping the namespace in the schema, as in the following example, results in a single line showing the aggregation of CPUUtilization metrics for all instances in the Region.
SEARCH(' {AWS/EC2} MetricName="CPUUtilization" ', 'Average', 300)
refer this for detailed explanation about search query.
To select metrics, refer this link for step-by-step explanation.

Report matching time stamps from automatically triggered lambda functions

I am using Amazon Cloud Watch to trigger 4 different lambda functions every twelve hours. The lambda functions pull some data from an api and save it to my database. I want to make sure that the timestamp matches for the data on all my lambda functions. Initially I used the PostgreSQL default timestamp however this records time to the millisecond which introduces small discrepancies in time.
It seems like the Cloud Watch rule which invokes my lambda functions might be able to pass along an identical time stamp but I haven't been able to figure out how to do this, or even verify if it is possible.
I really don't need the time stamp to go to the minute. Mostly I am concerned with the date and whether it was the AM or PM batch so knowing time to the nearest hour is good enough.
If any AWS experts could lend me some advice it would be appreciated.
The scheduled CloudWatch (CW) Event rule passes the following event object to the lambda function, e.g.:
{
"version": "0",
"id": "a75ba59d-81d6-8363-8e68-593f7de30b09",
"detail-type": "Scheduled Event",
"source": "aws.events",
"account": "32323232",
"time": "2021-02-21T06:29:27Z",
"region": "us-east-1",
"resources": [
"arn:aws:events:us-east-1:32323232:rule/test"
],
"detail": {}
}
As you can see, time is measured to the second. Also CW does not guarantee exact execution of it events. They can be off by 1 minute:
Your scheduled rule is triggered within that minute, but not on the precise 0th second
So your four functions will have slightly different time. Thus, you have to manage that in your code - round it to nearest hour for example.
The alternative is to use your lambda environment build in tools for getting timestamp, instead of using time from event. This can be easier as you can just get timestamp with precision of 1 hour directly, rather then parse the time from event and the post-process it to get desired precision.

Listing Notebook instances tags can takes ages

I am currently using the boto3 SDK from a Lambda function in order to retrieve various information about the Sagemaker Notebook Instances deployed in my account (almost 70 so not that many...)
One of the operations I am trying to perform is listing the tags for each instance.
However, from time to time it takes ages to return the tags : my Lambda either gets stopped (I could increase the timeout but still...) or a ThrottlingException is raised from the sagemaker.list_tags function (which could be avoided by increasing the number of retry upon sagemaker boto3 client creation) :
sagemaker = boto3.client("sagemaker", config=Config(retries = dict(max_attempts = 10)))
instances_dict = sagemaker.list_notebook_instances()
if not instances_dict['NotebookInstances']:
return "No Notebook Instances"
while instances_dict:
for instance in instances_dict['NotebookInstances']:
print instance['NotebookInstanceArn']
start = time.time()
tags_notebook_instance = sagemaker.list_tags(ResourceArn=instance['NotebookInstanceArn'])['Tags']
print (time.time() - start)
instances_dict = sagemaker.list_notebook_instances(NextToken=instances_dict['NextToken']) if 'NextToken' in instances_dict else None
If you guys have any idea to avoid such delays :)
TY
As you've noted you're getting throttled. Rather than increasing the number of retries you might try to change the delay (i.e. increase the growth_factor). Seems to be configurable looking at https://github.com/boto/botocore/blob/develop/botocore/data/_retry.json#L83
Note that buckets (and refill rates) are usually at the second granularity. So with 70 ARNs you're looking at some number of seconds; double digits does not surprise me.
You might want to consider breaking up the work differently since adding retries/larger growth_factor will just increase the length of time the function will run.
I've had pretty good success at breaking things up so that the Lambda function only processes a single ARN per invocation. The Lambda is processing work (I'll typically use a SQS queue to manage what needs to be processed) and the rate of work is configurable via a combination of configuring the Lambda and the SQS message visibility.
Not know what you're trying to accomplish outside of your original Lambda I realize that breaking up the work this way might (or will) add challenges to what you're doing overall.
It's also worth noting that if you have CloudTrail enabled the tags will be part of the event data (request data) for the "EventName" (which matches the method called, i.e. CreateTrainingJob, AddTags, etc.).
A third option would be if you are trying to find all of the notebook instances with a specific tag then you can use Resource Groups to create a query and find the ARNs with those tags fairly quickly.
CloudTrail: https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/Welcome.html
Resource Groups: https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html
Lambda with SQS: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

AWS DMS Ongoing Replication Falling Behind?

We are using AWS DMS for on-going replication of specific tables from one Oracle RDS database instance to another Oracle RDS database (both 11g).
Intermittently, the replication seems to fall behind or get out of sync. There are no errors in the log and everything is reported as successful, but data is missing.
We can kick off a full refresh and the data will show up, but this isn't a viable option on a regular basis. This is a production system and a full refresh takes upwards of 14 hours
We would like to monitor whether the destination database is [at least mostly] up to date. Meaning, no more than 2-3 hours behind.
I've found that you can get the current SCN from the source database using "SELECT current_scn FROM V$DATABASE" and from the target in the "awsdms_txn_state" table.
However, that table doesn't exist and I don't see any option to enable TaskRecoveryTableEnabled when creating or modifying a task.
Is there an existing feature that will automatically monitor these values? Can it be done through Lambda?
If DMS is reporting success, then we have no way of knowing that our data is hours or days behind until someone calls us complaining.
I do see an option in the DMS task to "Enable validation", but intuition tells me that's going to create a significant amount of unwanted overhead.
Thanks in advance.
There are a few questions here:
Task Monitoring of CDC Latency
How to set TaskRecoveryTableEnabled
For the first, task Monitoring provides a number of CloudWatch metrics (see all CDC* metrics).
It is possible to see on these metrics when the target is out of sync with the source, and where in the replication instance's process these changes are. The detailed blog from AWS explaining these Task Monitoring metrics is worth reading.
One option is to put a CloudWatch Alarm on the CDCLatencySource.
Alternatively you can create your own Lambda on a CloudWatch schedule to run your SCN queries on source and target and output a custom CloudWatch Metric using PutMetricData. You can create a CloudWatch Alarm on this metric if they are out of sync.
For the second question, to set the TaskRecoveryTableEnabled via the console tick the option "Create recovery table on target DB"
After ticking this you can confirm that the TaskRecoveryTableEnabled is set to Yes by looking at the Overview tab of the task. At the bottom there is the Task Settings json which will have something like:
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 0,
"LimitedSizeLobMode": true,
"LobMaxSize": 32,
"InlineLobMaxSize": 0,
"LoadMaxFileSize": 0,
"ParallelLoadThreads": 0,
"ParallelLoadBufferSize": 0,
"BatchApplyEnabled": false,
"TaskRecoveryTableEnabled": true
}

CloudWatch does not aggregate across dimensions for your custom metrics

Reading the docs I saw this statement;
CloudWatch does not aggregate across dimensions for your custom
metrics
That seems like a HUGE limitation right? It would make custom metrics all but useless in my estimation- so I want to confirm I'm understanding this.
For example say I had a custom metric I shipped from multiple servers. I want to see per server but I also want to see them all together. I would have no way of aggregating that accross all the servers? Or would i be forced to create two custom metrics, one for single server and one for all server and double post metrics from the servers to the per server one AND the one for aggregating all of them?
The docs are correct, CloudWatch won't aggregate across dimensions for your custom metrics (it will do so for some metrics published by other services, like EC2).
This feature may seem useful and clear for your use-case but it's not clear how such aggregation would behave in a general case. CloudWatch allows for up to 10 dimensions so aggregating for all combinations of those may result in a lot of useless metrics, for all of which you would be billed. People may use dimensions to split their metrics between Test and Prod stacks for example, which are completely separate and aggregating those would not make sense.
CloudWatch is treating a metric name plus a full set of dimensions as a unique metric identifier. In your case, this means that you need to publish your observations for each metric you want it contributing to separately.
Let's say you have a metric named Latency, and you're putting a hostname in a dimension called Server. If you have three servers this will create three metrics:
Latency, Server=server1
Latency, Server=server2
Latency, Server=server3
So the approach you mentioned in your question will work. If you also want a metric showing the data across all servers, each server would need to publish to a separate metric, which would be best to do by using a new common value for the Server dimension, something like AllServers. This will result in you having 4 metrics, like this:
Latency, Server=server1 <- only server1 data
Latency, Server=server2 <- only server2 data
Latency, Server=server3 <- only server3 data
Latency, Server=AllServers <- data from all 3 servers
Update 2019-12-17
Using metric math SEARCH function: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
This will give you per server latency and latency across all servers, without publishing a separate AllServers metric and if a new server shows up, it will be automatically picked up by the expression:
Graph source:
{
"metrics": [
[ { "expression": "SEARCH('{SomeNamespace,Server} MetricName=\"Latency\"', 'Average', 60)", "id": "e1", "region": "eu-west-1" } ],
[ { "expression": "AVG(e1)", "id": "e2", "region": "eu-west-1", "label": "All servers", "yAxis": "right" } ]
],
"view": "timeSeries",
"stacked": false,
"region": "eu-west-1"
}
Result will be a graph like this:
Downsides of this approach:
Expressions are limited to 100 metrics.
Overall aggregation is limited to available metric math functions, which means percentiles are not available as of 2019-12-17.
Using Contributor Insights (open preview as of 2019-12-17): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html
If you publish your logs to CloudWatch Logs in JSON or Common Log Format (CLF), you can create rules that keep track of top contributors. For example, a rule that keeps track servers with latencies over 400 ms would look something like this:
{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"AggregateOn": "Count",
"Contribution": {
"Filters": [
{
"Match": "$.Latency",
"GreaterThan": 400
}
],
"Keys": [
"$.Server"
],
"ValueOf": "$.Latency"
},
"LogFormat": "JSON",
"LogGroupNames": [
"/aws/lambda/emf-test"
]
}
Result is a list of servers with most datapoints over 400 ms:
Bringing it all together with CloudWatch Embedded Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
If you publish your data in CloudWatch Embedded Format you can:
Easily configure dimensions, so you can have per server metrics and overall metric if you want.
Use CloudWatch Logs Insights to query and visualise your logs.
Use Contributor Insights to get top contributors.