CloudWatch does not aggregate across dimensions for your custom metrics - amazon-web-services

Reading the docs I saw this statement;
CloudWatch does not aggregate across dimensions for your custom
metrics
That seems like a HUGE limitation right? It would make custom metrics all but useless in my estimation- so I want to confirm I'm understanding this.
For example say I had a custom metric I shipped from multiple servers. I want to see per server but I also want to see them all together. I would have no way of aggregating that accross all the servers? Or would i be forced to create two custom metrics, one for single server and one for all server and double post metrics from the servers to the per server one AND the one for aggregating all of them?

The docs are correct, CloudWatch won't aggregate across dimensions for your custom metrics (it will do so for some metrics published by other services, like EC2).
This feature may seem useful and clear for your use-case but it's not clear how such aggregation would behave in a general case. CloudWatch allows for up to 10 dimensions so aggregating for all combinations of those may result in a lot of useless metrics, for all of which you would be billed. People may use dimensions to split their metrics between Test and Prod stacks for example, which are completely separate and aggregating those would not make sense.
CloudWatch is treating a metric name plus a full set of dimensions as a unique metric identifier. In your case, this means that you need to publish your observations for each metric you want it contributing to separately.
Let's say you have a metric named Latency, and you're putting a hostname in a dimension called Server. If you have three servers this will create three metrics:
Latency, Server=server1
Latency, Server=server2
Latency, Server=server3
So the approach you mentioned in your question will work. If you also want a metric showing the data across all servers, each server would need to publish to a separate metric, which would be best to do by using a new common value for the Server dimension, something like AllServers. This will result in you having 4 metrics, like this:
Latency, Server=server1 <- only server1 data
Latency, Server=server2 <- only server2 data
Latency, Server=server3 <- only server3 data
Latency, Server=AllServers <- data from all 3 servers
Update 2019-12-17
Using metric math SEARCH function: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
This will give you per server latency and latency across all servers, without publishing a separate AllServers metric and if a new server shows up, it will be automatically picked up by the expression:
Graph source:
{
"metrics": [
[ { "expression": "SEARCH('{SomeNamespace,Server} MetricName=\"Latency\"', 'Average', 60)", "id": "e1", "region": "eu-west-1" } ],
[ { "expression": "AVG(e1)", "id": "e2", "region": "eu-west-1", "label": "All servers", "yAxis": "right" } ]
],
"view": "timeSeries",
"stacked": false,
"region": "eu-west-1"
}
Result will be a graph like this:
Downsides of this approach:
Expressions are limited to 100 metrics.
Overall aggregation is limited to available metric math functions, which means percentiles are not available as of 2019-12-17.
Using Contributor Insights (open preview as of 2019-12-17): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html
If you publish your logs to CloudWatch Logs in JSON or Common Log Format (CLF), you can create rules that keep track of top contributors. For example, a rule that keeps track servers with latencies over 400 ms would look something like this:
{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"AggregateOn": "Count",
"Contribution": {
"Filters": [
{
"Match": "$.Latency",
"GreaterThan": 400
}
],
"Keys": [
"$.Server"
],
"ValueOf": "$.Latency"
},
"LogFormat": "JSON",
"LogGroupNames": [
"/aws/lambda/emf-test"
]
}
Result is a list of servers with most datapoints over 400 ms:
Bringing it all together with CloudWatch Embedded Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
If you publish your data in CloudWatch Embedded Format you can:
Easily configure dimensions, so you can have per server metrics and overall metric if you want.
Use CloudWatch Logs Insights to query and visualise your logs.
Use Contributor Insights to get top contributors.

Related

google cloud platform -- creating alert policy -- how to specify message variable in alerting documentation markdown?

So I've created a logging alert policy on google cloud that monitors the project's logs and sends an alert if it finds a log that matches a certain query. This is all good and fine, but whenever it does send an email alert, it's barebones. I am unable to include anything useful in the email alert such as the actual message, the user must instead click on "View incident" and go to the specified timeframe of when the alert happened.
Is there no way to include the message? As far as I can tell viewing the gcp Using Markdown and variables in documentation templates doc on this.
I'm only really able to use ${resource.label.x} which isn't really all that useful because it already includes most of that stuff by default in the alert.
Could I have something like ${jsonPayload.message}? It didn't work when I tried it.
Probably (!) not.
To be clear, the alerting policies track metrics (not logs) and you've created a log-based metric that you're using as the basis for an alert.
There's information loss between the underlying log (that contains e.g. jsonPayload) and the metric that's produced from it (which probably does not). You can create Log-based metrics labels using expressions that include the underlying log entry fields.
However, per the example in Google's docs, you'd want to consider a limited (enum) type for these values (e.g. HTTP status although that may be too broad too) rather than a potentially infinite jsonPayload.
It is possible. Suppose you need to pass "jsonPayload.message" present in your GCP log to documentation section in your policy. You need to use "label_extractor" feature to extract your log message.
I will share a policy creation JSON file template wherein you can pass "jsonPayload.message" in the documentation section in your policy.
policy_json = {
"display_name": "<policy_name>",
"documentation": {
"content": "I have the extracted the log message:${log.extracted_label.msg}",
"mime_type": "text/markdown"
},
"user_labels": {},
"conditions": [
{
"display_name": "<condition_name>",
"condition_matched_log": {
"filter": "<filter_condition>",
"label_extractors": {
"msg": "EXTRACT(jsonPayload.message)"
}
}
}
],
"alert_strategy": {
"notification_rate_limit": {
"period": "300s"
},
"auto_close": "604800s"
},
"combiner": "OR",
"enabled": True,
"notification_channels": [
"<notification_channel>"
]
}

How to monitor all ec2 by CPU usage via CloudWatch

I am trying to set up monitoring of a large number of ec2s and their number is constantly changing. I would like the owner of this instance to receive a notification when the CPU usage is low for a long time.
I can create a function that would get a list of all ec2s, then get their CPU utilization, then send messages to the owners. This option does not suit me, since it takes some time to monitor the state, and not get the CPU utilization values per second of the function launch. And in general, this method looks bad.
I can set up alarm in cloudwatch, but only for one specific instance. This option is not suitable, since there are a lot of ec2 and their number varies.
I can create a dashboard with ec2 names and their CPU utilization. This dashboard will be dynamically replenished. But I haven't figured out how to send notifications from it.
How can I solve my problem without third-party solutions?
Please see this AWS document https://aws.amazon.com/blogs/mt/use-tags-to-create-and-maintain-amazon-cloudwatch-alarms-for-amazon-ec2-instances-part-1/
You will find some existing Lambda functions which will create Cloudwatch alert after creating EC2 instance automatically.
It looks a little bit tricky but worth seeing if you really want to make it automatic. But yes single cloud watch alert can't monitor multiple EC2 instances.
--
Another thing, same sample lambda function you will find from the existing template and it will directly create that lambda function and you can test it.
I have solved my problem. And it seems to me that this is one of the simplest options.
Using method get_metric_data from AWS SDK for Python Boto3 I wrote a function:
import boto3
from statistics import mean
from datetime import timedelta, datetime
cloudwatch_client = boto3.client('cloudwatch')
response = cloudwatch_client.get_metric_data(
MetricDataQueries=[
{
'Id': 'myrequest',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/EC2',
'MetricName': 'CPUUtilization',
'Dimensions': [
{
'Name': 'InstanceId',
'Value': 'i-123abc456def'
}
]
},
'Period': 3600,
'Stat': 'Average',
'Unit': 'Percent'
}
},
],
StartTime=datetime.now() - timedelta(days=1),
EndTime=datetime.now()
)
for MetricDataResults in response['MetricDataResults']:
list_avg = mean(MetricDataResults['Values'])
print(list_avg)
At the output, I get the average CPU usage as a percentage. For the specified time.
I'm still learning, but I'll try to answer your questions if there are any. Thank you all!

Getting single time series from AWS CloudWatch metric maths SEARCH function

I'm attempting to creating a CloudWatch alarm for if any instances in a group go over x% of memory used and have built the following metric maths query to do so:
SEARCH('{CWAgent,InstanceId} MetricName="mem_used_percent"', 'Maximum', 300)
This graphs fine, however the CloudWatch console complains "The expression for an alarm must create exactly one time series.". I believe this is the case; The query above should (and does) return a singular line graph result that is not multi-dimensional.
How can I get this data to return in the format required by CloudWatch to create an alarm? My alternative is to general a new alarm per instance creation, however this seems more complex to manage the creation and destruction of alarms.
CloudWatch config on instance for collecting metric:
"metrics":{
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected":{
"mem": {
"measurement": [
"used_percent"
]
},
"disk": {
"measurement": [ "used_percent" ],
"metrics_collection_interval": 60,
"resources": [ "/" ]
}
}
Unfortunately it's not possible to create an alarm based on a search expression, so I don't think there's (currently) a way to do what you're after.
Per https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create-alarm-on-metric-math-expression.html:
You can't create an alarm based on the SEARCH expression. This is because search expressions return multiple time series, and an alarm based on a math expression can watch only one time series.
This appears to be the case even when you only get one result from a SEARCH expression.
I tried to combine this down into one time series using AVG, but this then appeared to lose the context of the metric and instead gave the error 'The expression for an alarm must include at least one metric'.
I'm currently handling a similar case with a pair of Lambda functions tied to CloudTrail events for RunInstances and TerminateInstances, that parse the event data for the instance ID and (among other things) create and delete individual CloudWatch alarms.
This example displays one line for each instance in the Region, showing the CPUUtilization metric from the AWS/EC2 namespace.
SEARCH(' {AWS/EC2,InstanceId} MetricName="CPUUtilization" ', 'Average', 300)
Changing InstanceId to InstanceType changes the graph to show one line for each instance type used in the Region. Data from all instances of each type is aggregated into one line for that instance type.
SEARCH(' {AWS/EC2,InstanceType} MetricName="CPUUtilization" ', 'Average', 300)
Removing the dimension name but keeping the namespace in the schema, as in the following example, results in a single line showing the aggregation of CPUUtilization metrics for all instances in the Region.
SEARCH(' {AWS/EC2} MetricName="CPUUtilization" ', 'Average', 300)
refer this for detailed explanation about search query.
To select metrics, refer this link for step-by-step explanation.

AWS cloud watch metrics with ASG name changes

On AWS cloud watch we have one dashboard per environment.
Each dashboard has N plots.
Some plots, use the Auto Scaling Group Name (ASG) to find the data to plot.
Example of such a plot (edit, tab source):
{
"metrics": [
[ "production", "mem_used_percent", "AutoScalingGroupName", "awseb-e-rv8y2igice-stack-AWSEBAutoScalingGroup-3T5YOK67T3FD" ]
],
... other params removed for brevity ...
"title": "Used Memory (%)",
}
Every time we deploy, the ASG name changes (deploy using code-deploy with Elastic Bean Stalk (EBS) configuration files from source).
I need to manually find the new name and update the N plots one by one.
The strange thing is that this happens for production and staging environments, but not for integration.
All 3 should be copies of one another, with different settings from the EBS configuration files, so I don't know what is going on.
In any case, what (I think) I need is one of:
option 1: prevent the ASG name change upon deploy
option 2: dynamically update the plots with the new name
option 3: plot the same data without using the ASG name (but alternatives I find are EC2 instance ID that changes and ImageId and InstanceType that are common to more than one EC2, so won't work either)
My online-search-foo has turned out empty.
More Info:
I'm publishing these metrics with the cloud watch agent, by adjusting the conf file, as per the docs here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html
Have a look at CloudWatch Search Expression Syntax. It allows you to use tokens for searching, e.g.:
SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName="mem_used_percent" rv8y2igice', 'Average', 300)
which would replace the entry for metrics like so:
"metrics": [
[ { "expression": "SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName=\"mem_used_percent\" rv8y2igice', 'Average', 300)", "label": "Expression1", "id": "e1" } ]
]
Simply search the desired result in the console, results that match the search appear.
To graph, all of the metrics that match your search, choose Graph search
and find the accurate search expression that you want in the Details on the Graphed metrics tab.
SEARCH('{CWAgent,AutoScalingGroupName,ImageId,InstanceId,InstanceType} mem_used_percent', 'Average', 300)

AWS Glue Crawler updating existing catalog tables is (painfully) slow

I am continuously receiving and storing multiple feeds of uncompressed JSON objects, partitioned to the day, to different locations of an Amazon S3 bucket (hive-style: s3://bucket/object=<object>/year=<year>/month=<month>/day=<day>/object_001.json), and was planning to incrementally batch and load this data to a Parquet data lake using AWS Glue:
Crawlers would update manually created Glue tables, one per object feed, for schema and partition (new files) updates;
Glue ETL Jobs + Job bookmarking would then batch and map all new partitions per object feed to a Parquet location now and then.
This design pattern & architecture seemed to be quite a safe approach as it was backed up by many AWS blogs, here and there.
I have a crawler configured as so:
{
"Name": "my-json-crawler",
"Targets": {
"CatalogTargets": [
{
"DatabaseName": "my-json-db",
"Tables": [
"some-partitionned-json-in-s3-1",
"some-partitionned-json-in-s3-2",
...
]
}
]
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
And each table was "manually" initialized as so:
{
"Name": "some-partitionned-json-in-s3-1",
"DatabaseName": "my-json-db",
"StorageDescriptor": {
"Columns": [] # i'd like the crawler to figure this out on his first crawl,
"Location": "s3://bucket/object=some-partitionned-json-in-s3-1/",
"PartitionKeys": [
{
"Name": "year",
"Type": "string"
},
{
"Name": "month",
"Type": "string"
},
{
"Name": "day",
"Type": "string"
}
],
"TableType": "EXTERNAL_TABLE"
}
}
First run of the crawler is, as expected, an hour-ish long, but it successfully figures out the table schema and existing partitions. Yet from that point onward, re-running the crawler takes the exact same amount of time as the first crawl, if not longer; which lead me to believe that the crawler is not only crawling for new files / partitions, but recrawling all the entire S3 locations each time.
Note that the delta of new files between two crawls is very small (few new files are to be expected each time).
AWS Documentation suggests running multiple crawlers, but I am not convinced that this would solve my problem on the long run. I also considered updating the crawler exclude pattern after each run, but then I would see too few advantages using Crawlers over manually updating Tables partitions through some Lambda boto3 magic.
Am I missing something there ? Maybe an option I would have misunderstood regarding crawlers updating existing data catalogs rather than crawling data stores directly ?
Any suggestions to improve my data cataloging ? Given that indexing these JSON files in Glue tables is only necessary to me as I want my Glue Job to use bookmarking.
Thanks !
AWS Glue Crawlers now support Amazon S3 event notifications natively, to solve this exact problem.
See the blog post.
Still getting some hits on this unanswered question of mine, so I wanted to share a solution I found adequate at the time: I ended up not using crawlers, at all to incrementally update my Glue tables.
Using S3 Events / S3 Api Calls via CloudTrail / S3 Eventbridge notifications (pick one), ended up writing a lambda which pops a ALTER TABLE ADD PARTITION DDL query on Athena, updating an already existing Glue table with the newly created partition, based on the S3 key prefix. This is a pretty straight-forward and low-code approach to maintaining Glue tables in my opinion; the only downside being handling service throttling (both Lambda and Athena), and failing queries to avoid any loss of data in the process.
This solution scales up pretty well though, as parallel DDL queries per account is a soft-limit quota that can be increased as your need for updating more and more tables increases; and works pretty well for non-time-critical workflows.
Works even better if you limit S3 writes to your Glue tables S3 partitions (one file per Glue table partition is ideal in this particular implementation) by batching your data, using a Kinesis DeliveryStream for example.