AWS cloud watch metrics with ASG name changes - amazon-web-services

On AWS cloud watch we have one dashboard per environment.
Each dashboard has N plots.
Some plots, use the Auto Scaling Group Name (ASG) to find the data to plot.
Example of such a plot (edit, tab source):
{
"metrics": [
[ "production", "mem_used_percent", "AutoScalingGroupName", "awseb-e-rv8y2igice-stack-AWSEBAutoScalingGroup-3T5YOK67T3FD" ]
],
... other params removed for brevity ...
"title": "Used Memory (%)",
}
Every time we deploy, the ASG name changes (deploy using code-deploy with Elastic Bean Stalk (EBS) configuration files from source).
I need to manually find the new name and update the N plots one by one.
The strange thing is that this happens for production and staging environments, but not for integration.
All 3 should be copies of one another, with different settings from the EBS configuration files, so I don't know what is going on.
In any case, what (I think) I need is one of:
option 1: prevent the ASG name change upon deploy
option 2: dynamically update the plots with the new name
option 3: plot the same data without using the ASG name (but alternatives I find are EC2 instance ID that changes and ImageId and InstanceType that are common to more than one EC2, so won't work either)
My online-search-foo has turned out empty.
More Info:
I'm publishing these metrics with the cloud watch agent, by adjusting the conf file, as per the docs here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html

Have a look at CloudWatch Search Expression Syntax. It allows you to use tokens for searching, e.g.:
SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName="mem_used_percent" rv8y2igice', 'Average', 300)
which would replace the entry for metrics like so:
"metrics": [
[ { "expression": "SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName=\"mem_used_percent\" rv8y2igice', 'Average', 300)", "label": "Expression1", "id": "e1" } ]
]

Simply search the desired result in the console, results that match the search appear.
To graph, all of the metrics that match your search, choose Graph search
and find the accurate search expression that you want in the Details on the Graphed metrics tab.
SEARCH('{CWAgent,AutoScalingGroupName,ImageId,InstanceId,InstanceType} mem_used_percent', 'Average', 300)

Related

Getting single time series from AWS CloudWatch metric maths SEARCH function

I'm attempting to creating a CloudWatch alarm for if any instances in a group go over x% of memory used and have built the following metric maths query to do so:
SEARCH('{CWAgent,InstanceId} MetricName="mem_used_percent"', 'Maximum', 300)
This graphs fine, however the CloudWatch console complains "The expression for an alarm must create exactly one time series.". I believe this is the case; The query above should (and does) return a singular line graph result that is not multi-dimensional.
How can I get this data to return in the format required by CloudWatch to create an alarm? My alternative is to general a new alarm per instance creation, however this seems more complex to manage the creation and destruction of alarms.
CloudWatch config on instance for collecting metric:
"metrics":{
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected":{
"mem": {
"measurement": [
"used_percent"
]
},
"disk": {
"measurement": [ "used_percent" ],
"metrics_collection_interval": 60,
"resources": [ "/" ]
}
}
Unfortunately it's not possible to create an alarm based on a search expression, so I don't think there's (currently) a way to do what you're after.
Per https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Create-alarm-on-metric-math-expression.html:
You can't create an alarm based on the SEARCH expression. This is because search expressions return multiple time series, and an alarm based on a math expression can watch only one time series.
This appears to be the case even when you only get one result from a SEARCH expression.
I tried to combine this down into one time series using AVG, but this then appeared to lose the context of the metric and instead gave the error 'The expression for an alarm must include at least one metric'.
I'm currently handling a similar case with a pair of Lambda functions tied to CloudTrail events for RunInstances and TerminateInstances, that parse the event data for the instance ID and (among other things) create and delete individual CloudWatch alarms.
This example displays one line for each instance in the Region, showing the CPUUtilization metric from the AWS/EC2 namespace.
SEARCH(' {AWS/EC2,InstanceId} MetricName="CPUUtilization" ', 'Average', 300)
Changing InstanceId to InstanceType changes the graph to show one line for each instance type used in the Region. Data from all instances of each type is aggregated into one line for that instance type.
SEARCH(' {AWS/EC2,InstanceType} MetricName="CPUUtilization" ', 'Average', 300)
Removing the dimension name but keeping the namespace in the schema, as in the following example, results in a single line showing the aggregation of CPUUtilization metrics for all instances in the Region.
SEARCH(' {AWS/EC2} MetricName="CPUUtilization" ', 'Average', 300)
refer this for detailed explanation about search query.
To select metrics, refer this link for step-by-step explanation.

Does AWS Sagemaker PySparkProcessor manage autoscaling?

I'm using Sagemaker to generate to do preprocessing and generate training data and I'm following the Sagemaker API documentation here, but I don't see any way currently how to specify autoscaling within the EMR cluster. What should I include within the configuration argument that I pass to my spark_processor run() object? What shouldn't I include?
I'm aware of the this resource, but it doesn't seem comprehensive.
Below is my code; it is very much a "work-in-progress", but I would like to know if someone could provide me with or point me to a resource that shows:
Whether this PySparkProcessor object will manage autoscaling automatically. Should I put AutoScaling config within the configuration in the run() object?
An example of the full config that I can pass to the configuration variable.
Here's what I have so far for the configuration.
SPARK_CONFIG = \
{ "Configurations": [
{ "Classification": "spark-env",
"Configurations": [ {"Classification": "export"} ] }
]
}
spark_processor = PySparkProcessor(
tags=TAGS,
role=IAM_ROLE,
instance_count=2,
py_version="py37",
volume_size_in_gb=30,
container_version="1",
framework_version="3.0",
network_config=sm_network,
max_runtime_in_seconds=1800,
instance_type="ml.m5.2xlarge",
base_job_name=EMR_CLUSTER_NAME,
sagemaker_session=sagemaker_session,
)
spark_processor.run(
configuration=SPARK_CONFIG,
submit_app=LOCAL_PYSPARK_SCRIPT_DIR,
spark_event_logs_s3_uri="s3://{BUCKET_NAME}/{S3_PYSPARK_LOG_PREFIX}",
)
I'm used to interacting via Python more directly with EMR for these types of tasks. Doing that allows me to specify the entire EMR cluster config at once--including applications, autoscaling, EMR default and autoscaling roles--and then adding the steps to the cluster once it's created; however, much of this config seems to be abstracted away, and I don't know what remains or needs to be specified, specifically regarding the following config variables: AutoScalingRole, Applications, VisibleToAllUsers, JobFlowRole/ServiceRole etc.
I found the answer in the Sagemaker Python SDK github.
_valid_configuration_keys = ["Classification", "Properties", "Configurations"]
_valid_configuration_classifications = [
"core-site",
"hadoop-env",
"hadoop-log4j",
"hive-env",
"hive-log4j",
"hive-exec-log4j",
"hive-site",
"spark-defaults",
"spark-env",
"spark-log4j",
"spark-hive-site",
"spark-metrics",
"yarn-env",
"yarn-site",
"export",
]
Thus, specifying autoscaling, Visibility, and some other cluster level configurations seems not to be supported. However, the applications installed upon cluster start up seem to depend on the applications in the above list.

How in terraform using google platform provider do I get instance information from a instance created with google_compute_region_instance_group?

I am creating a terraform file so I can setup some VMs in GCP to build my own Kubernetes platform (Yes google has their own engine but I want to use some custom items). I have been able to create the .tf file to create the whole stack just like the other setup in the Kubespray project. Something like what you do to terraform VMs on AWS.
The last part I need to automate is the creation of the host file for Ansible.
I create the Masters and Workers using a resource called google_compute_region_instance_group which places each instance in a different AZ with in GCP. Now I need to get the hostname and IP give to these instances. The problem I have is that they are dynamically created recourses. So to pull this information out I use a data source to grab the info.
Here is what I have now.
data.google_compute_region_instance_group.data_masters.instances
[
{
"instance" = "https://www.googleapis.com/compute/v1/projects/appportablityphase2/zones/us-east1-c/instances/k8-masters-4r2f"
"named_ports" = []
"status" = "RUNNING"
},
{
"instance" = "https://www.googleapis.com/compute/v1/projects/appportablityphase2/zones/us-east1-d/instances/k8-masters-qh64"
"named_ports" = []
"status" = "RUNNING"
},
{
"instance" = "https://www.googleapis.com/compute/v1/projects/appportablityphase2/zones/us-east1-b/instances/k8-masters-w9c8"
"named_ports" = []
"status" = "RUNNING"
},
]
As you can see the output is a mix of a list and maps. I am able to get just the instance self url with this line.
lookup(data.google_compute_region_instance_group.data_masters.instances[0], "instance")
https://www.googleapis.com/compute/v1/projects/appportablityphase2/zones/us-east1-c/instances/k8-masters-4r2f
Which then I can split and get the instance name. This is the hard part that I can not figure out with Terraform. In the above line I have to use [0] to call the instance information. I then need to iterate through all of the instance which may be more then 3 or 3.
I can not find a way to do this with this data source type. I have tried count.index but it only supported in a resource type not data source. I have also tried splat syntax and it has not worked.
I don't think generating manually the inventory is the right approach although it is possible.
You could give a try to GCP Dynamic Inventory, which generates inventory from running instances based on their network tags.
For instance, instance A has tags foo, and instance B has tags foo and bar, the generated inventory will be:
[tag_foo]
A
B
[tag_bar]
B
Script is available at this address: https://github.com/ansible/ansible/blob/devel/contrib/inventory/gce.py
Configuration file here: https://github.com/ansible/ansible/blob/devel/contrib/inventory/gce.ini
And usage is ansible-playbook -i gce.py site.yml

Set 'maxActiveInstances' error

I am using AWS data-pipeline to export a DDB table, but when I activate I get an error:
Web service limit exceeded: Exceeded number of concurrent executions. Please set the field 'maxActiveInstances' to a higher value in your pipeline or wait for the currenly running executions to complete before trying again (Service: DataPipeline; Status Code: 400; Error Code: InvalidRequestException; Request ID: efbf9847-49fb-11e8-abef-1da37c3550b5)
How do I set this maxActiveInstances property using the AWS UI?
You can set it as a property on your Ec2Resource[1] (or EmrActivity[2]) object. Using the UI, click Edit Pipeline, click on Resources on the right hand side of the screen (it's a collapsable menu). There should be an Ec2Resource object. There should be a drop down on this object called "Add an additional field" and you should see max active instances in the drop down.
[1]https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html
[2] https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emractivity.html
We ran into this too. For an on-demand pipeline, it looks like after a certain number of retries, you have to give it time to finish terminating the provisioned resources before you will be allowed to try again.
Solution: Patience.
With an on-demand pipline you can specify it in the 'Default object', like this
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default",
"maxActiveInstances": "5"
},
...
I couldn't add it in Architect, I had to create another pipeline from the json. But once that was done I could edit it in Architect (under 'Others' section).

CloudWatch does not aggregate across dimensions for your custom metrics

Reading the docs I saw this statement;
CloudWatch does not aggregate across dimensions for your custom
metrics
That seems like a HUGE limitation right? It would make custom metrics all but useless in my estimation- so I want to confirm I'm understanding this.
For example say I had a custom metric I shipped from multiple servers. I want to see per server but I also want to see them all together. I would have no way of aggregating that accross all the servers? Or would i be forced to create two custom metrics, one for single server and one for all server and double post metrics from the servers to the per server one AND the one for aggregating all of them?
The docs are correct, CloudWatch won't aggregate across dimensions for your custom metrics (it will do so for some metrics published by other services, like EC2).
This feature may seem useful and clear for your use-case but it's not clear how such aggregation would behave in a general case. CloudWatch allows for up to 10 dimensions so aggregating for all combinations of those may result in a lot of useless metrics, for all of which you would be billed. People may use dimensions to split their metrics between Test and Prod stacks for example, which are completely separate and aggregating those would not make sense.
CloudWatch is treating a metric name plus a full set of dimensions as a unique metric identifier. In your case, this means that you need to publish your observations for each metric you want it contributing to separately.
Let's say you have a metric named Latency, and you're putting a hostname in a dimension called Server. If you have three servers this will create three metrics:
Latency, Server=server1
Latency, Server=server2
Latency, Server=server3
So the approach you mentioned in your question will work. If you also want a metric showing the data across all servers, each server would need to publish to a separate metric, which would be best to do by using a new common value for the Server dimension, something like AllServers. This will result in you having 4 metrics, like this:
Latency, Server=server1 <- only server1 data
Latency, Server=server2 <- only server2 data
Latency, Server=server3 <- only server3 data
Latency, Server=AllServers <- data from all 3 servers
Update 2019-12-17
Using metric math SEARCH function: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
This will give you per server latency and latency across all servers, without publishing a separate AllServers metric and if a new server shows up, it will be automatically picked up by the expression:
Graph source:
{
"metrics": [
[ { "expression": "SEARCH('{SomeNamespace,Server} MetricName=\"Latency\"', 'Average', 60)", "id": "e1", "region": "eu-west-1" } ],
[ { "expression": "AVG(e1)", "id": "e2", "region": "eu-west-1", "label": "All servers", "yAxis": "right" } ]
],
"view": "timeSeries",
"stacked": false,
"region": "eu-west-1"
}
Result will be a graph like this:
Downsides of this approach:
Expressions are limited to 100 metrics.
Overall aggregation is limited to available metric math functions, which means percentiles are not available as of 2019-12-17.
Using Contributor Insights (open preview as of 2019-12-17): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html
If you publish your logs to CloudWatch Logs in JSON or Common Log Format (CLF), you can create rules that keep track of top contributors. For example, a rule that keeps track servers with latencies over 400 ms would look something like this:
{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"AggregateOn": "Count",
"Contribution": {
"Filters": [
{
"Match": "$.Latency",
"GreaterThan": 400
}
],
"Keys": [
"$.Server"
],
"ValueOf": "$.Latency"
},
"LogFormat": "JSON",
"LogGroupNames": [
"/aws/lambda/emf-test"
]
}
Result is a list of servers with most datapoints over 400 ms:
Bringing it all together with CloudWatch Embedded Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
If you publish your data in CloudWatch Embedded Format you can:
Easily configure dimensions, so you can have per server metrics and overall metric if you want.
Use CloudWatch Logs Insights to query and visualise your logs.
Use Contributor Insights to get top contributors.