Best way to implement AWS ECS healthcheck - amazon-web-services

I'm implementing ECS health-check functionality and and I'm thinking about the best way to do that.
For now I have found several solutions:
Using AWS ECS metrics and Dimensions and check whether some metric has insufficient value
Using CloudWatch Alarm:
ECSHealthAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Alarm for ECS StatusCheckFailed Metric
ComparisonOperator: GreaterThanOrEqualToThreshold
EvaluationPeriods: 2
Statistic: Maximum
MetricName: StatusCheckFailed
Namespace: AWS/ECS
Period: 30
Threshold: 1.0
AlarmActions:
- !Ref AlarmTopic
InsufficientDataActions:
- !Ref AlarmTopic
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !GetAtt service.Name
Using CloudWatch event:
EventRule:
Type: "AWS::Events::Rule"
Properties:
Name: CloudWatchRMExtensionECSStoppedRule
Description: "Notify when ECS container stopped"
EventPattern:
source: ["aws.ecs"]
detail-type: ["ECS Task State Change", "ECS Container Instance State Change"]
detail:
clusterArn: [ 'clusterArn' ]
lastStatus: [ "STOPPED" ]
stoppedReason: [ "Essential container in task exited" ]
group: [ 'service-group' ]
State: "ENABLED"
Targets:
- Arn: !Ref ECSAlarmSNSTopic
Id: "PublishAlarmTopic"
InputTransformer:
InputPathsMap:
stopped-reason: "$.detail.stoppedReason"
InputTemplate: '"This micro-service has been stopped with the following reason: <stopped-reason>"'
Could you please advice whether those variants are correct or there is ant other way to do that more efficient? Thanks for any help!

I am not able to put comment, so here are some thoughts. I am bit unclear of your requirement, whether you are looking for alerts from EC2 server level status check or each ECS service tasks level. I am adding all the possible options here.
I would run ECS cluster EC2 instances under an Auto-Scaling Group and based on ASG CloudWatch metrics, setup a SNS notification when instances are being added/removed.
https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html
We can have AWS ecs-agent docker container logs also sent to CloudWatch and get some SNS notifications based on errors or filtered events.
We can have subscription to CW from ECS event stream as well when each service tasks being started/stopped. References - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_event_stream.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwet.html
Example event entries are in below link – https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwe_events.html
Reference for setting alarm based on log events.
https://medium.com/#martatatiana/insufficient-data-cloudwatch-alarm-based-on-custom-metric-filter-4e41c1f82050
Adding healthcheck for each ECS service wise and have containers restarted if they are not doing well.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_healthcheck
Please do let me know your thoughts as well :).

Related

CloudWatch Alarm in EC2 template

I am setting up an AWS EC2 template based on a custom image for launching instances for a certain purpose. These instances then also need CloudWatch alarms monitoring their activity and perform some action based on them (e.g. stop instance if inactive for 30 min.).
Is there any way I can include such alarms into the EC2 template? I would like to avoid having to manually add the alarms to the instance after creation. I couldn't find this as an option anywhere in the template creation dialogue.
From management console - could not find a straight forward option.
Using EC2 Tags, Lambda and other services - might be possible - check the link
CloudFormation - you can write a CF template to create EC2 and add an alarm to it. You can continue enhancing it.
This option will make things easier once the template is created as you will not need to select various UI options whenever you launch new EC2 and add alarm.
This template will ask for instance type, will create an alarm for EC2 and publish to an SNS topic.
Verify AMI, AZ if you are logged into a different region.
Parameters:
InstanceType:
Description: EC2 instance type
Type: String
Default: t2.small
AllowedValues:
- t1.micro
- t2.nano
- t2.micro
- t2.small
ConstraintDescription: It must be a valid EC2 instance type.
Resources:
MyInstance1:
Type: AWS::EC2::Instance
Properties:
AvailabilityZone: us-east-1a
ImageId: ami-05912b6333beaa478
InstanceType: !Ref InstanceType
KeyName: KP-EC2-Lambda
SecurityGroups:
- launch-wizard-2
CPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: CPU alarm for my instance
AlarmActions:
- Ref: "MyTopic1"
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '60'
EvaluationPeriods: '3'
Threshold: '90'
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: InstanceId
Value:
Ref: "MyInstance1"
MyTopic1:
Type: AWS::SNS::Topic
Properties:
DisplayName: MyTopic1
Subscription:
- Endpoint: "xyz#xyz.com"
Protocol: "email"
TopicName: MyTopic1

AWS Capacity Provider for ECS cluster not triggering scale-in event

I have added a Capacity Provider to an ECS cluster. While scale-out events work as expected due to changes in CapacityProviderReservation metric, scale-in events do not work.
In my case, the TargetCapacity property is set to 90, but looking at CloudWatch the average for the CapacityProviderReservation metric currently sits at 50%. This has been the case for the last 16 hours.
According to AWS's own documentation, scale-in events occur -
When using dynamic scaling policies and the size of the group decreases as a result of changes in a metric's value
So it seems like the Capacity Provider is not changing the desired size of the ASG as expected.
Am I missing something here, or do capacity providers tied to ASG's simply not work both ways?
ASG and Capacity Provider resources in CloudFormation
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Sub ${ResourceNamePrefix}-asg
VPCZoneIdentifier:
- !Ref PrivateSubnetAId
LaunchTemplate:
LaunchTemplateId: !Ref Ec2LaunchTemplate
Version: !GetAtt Ec2LaunchTemplate.LatestVersionNumber
MinSize: 0
MaxSize: 3
DesiredCapacity: 1
EcsCapacityProvider:
Type: AWS::ECS::CapacityProvider
Properties:
Name: !Sub ${ResourceNamePrefix}-ecs-capacity-provider
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref AutoScalingGroup
ManagedScaling:
Status: ENABLED
TargetCapacity: 90
ManagedTerminationProtection: DISABLED
Dynamic scaling policy for ASG
Current status of the CapacityProviderReservation metric
The CapacityProviderReservation metric has been at 50% for well over 12 hours.
Current status of the Capacity Provider
As you can see, the desired size is still 2, while it is expected that this should have dropped back to 1.
Update
After deleting and recreating the cluster, I notice that the Capacity Provider changes the DesiredCapacity to 2 instantly, even though there are no tasks running.

ECS deployment changes target group - how to maintain alarms that depend on target group?

I have a workload running as an ECS service attached to a target group. Then I have an alarm monitoring that target group's instance count (HealthyHostCount). I'd like to implement blue/green deployments using 2 target groups, but it seems like because the alarm monitors a specific target group's value, it needs to be updated every deployment separately from the actual deployment.
This seems fragile and that there would be a better way to do this (e.g. after the deployment if we have a script that updates the alarm's target group, it could fail), but I can't see the better way. Is there an obviously easier solution?
Instead of monitoring you have the desired number of healthy targets, monitor that you have no unhealthy ones.
Your ECS service will take care of managing your desired count, plus you might want to scale the service so UnHealthyHostCount is the better metric to alarm on, I think anyway.
Create one alarm for each target group as below.
These won't trigger between normal ECS blue/green deployments, only if there is a registered target failing health-checks. You need to tune the health-check settings on the target group and HealthCheckGracePeriodSeconds setting for the ECS service accordingly.
BlueUnHealthyHostCountAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Alarms when there is any unhealthy target'
Namespace: 'AWS/ApplicationELB'
MetricName: UnHealthyHostCount
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
ComparisonOperator: GreaterThanThreshold
Threshold: 1
AlarmActions:
- Topic
Dimensions:
- Name: LoadBalancer
Value: AlbFullName
- Name: TargetGroup
Value: BlueTargetGroup
GreenUnHealthyHostCountAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Alarms when there is any unhealthy target'
Namespace: 'AWS/ApplicationELB'
MetricName: UnHealthyHostCount
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
ComparisonOperator: GreaterThanThreshold
Threshold: 1
AlarmActions:
- Topic
Dimensions:
- Name: LoadBalancer
Value: AlbFullName
- Name: TargetGroup
Value: GreenTargetGroup

Am looking for a way to add new cloudwatch events to existing redshift cluster via cloudformation

So i already have an existing redshift cluster running which I created with cloudformation, Now I need to add a new cloudwatch event to this cluster like below code, How do i map the new alarm with existing cluster.
This is for existing AWS Redshift cluster
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: !Join [ " ", [ "Health status alarm for", !Ref RedshiftCluster, "Redshift Cluster"]]
AlarmActions:
- !Ref redshiftClusterSNSTopic
MetricName: HealthStatus
Namespace: AWS/Redshift
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 1
ComparisonOperator: LessThanThreshold
Dimensions:
- Name: ClusterIdentifier
Value: !Ref CARedshiftCluster
Not sure how to do this, help is appreciated.
You can give cloudkast a try. It is an online cloudformation template generator. It is regularly updated. As of now it does support cloudwatch.

Cloudwatch Get InstanceId

How do I get the InstanceId of all instances of Cloudwatch Alarm, I am trying to create a cloudwatch alarm to send email if the disk reach 90% usage.
Resources:
EC2DiskHealth:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: { "Fn::Join" : ["", [{ "Ref" : "AWSEBEnvironmentName" }, ": Disk Usage" ]]}
Namespace: System/Linux
MetricName: DiskSpaceAvailable
Dimensions:
- Name: InstanceId
Value : { "Ref" : "instance-id" }
- Name: Filesystem
Value: /dev/xvda1
- Name: MountPath
Value: /
Statistic: Average
Period: 60
EvaluationPeriods: 5
Threshold:
Fn::GetOptionSetting:
OptionName: ELBHealth
DefaultValue: "90"
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- arn:aws:sns:awsregion:sns
InsufficientDataActions:
- arn:aws:sns:awsregion:sns
OKActions:
- arn:aws:sns:awsregion:sns
Output:
I should be able to get the instance-id in order for the alarm to work.
Dimensions:
- Name: InstanceId
Value : { "Ref" : "instance-id" }
Error:
Service:AmazonCloudFormation, Message:Template format error: Unresolved resource dependencies [instance-id] in the Resources block of the template
It appears that your situation is:
You have some existing Amazon EC2 instances
You are running some script/code on the instances that send a metric called DiskSpaceAvailable at regular intervals to Amazon CloudWatch
You wish to create a CloudFormation template
The template should create an Alarm for every EC2 instance when DiskSpaceAvailable exceeds a certain metric
This is not possible.
An Amazon CloudWatch template can create resources and can refer to resources, but it cannot go out and discover resources, nor perform loops over discovered resources.
A template could, for example, create an instance and then add an alarm specifically for that instance. However, it won't auto-discover resources.
You can write an AWS Lambda-backed Custom Resource that can do whatever you wish (you'd have to write the code), but your code would need to create the alarms rather than CloudFormation.
Bottom line: Your use-case is best done via your own code (Lambda or just straight code) rather than using CloudFormation.
Your question is to send an email when cloudwatch detect that the disk instance is over 90% used.
It is the basics of cloudwatch task : create the email notification in the cloudwatch alarm itself, set the emails and save.
more details here (it is an example related to CPU, but it is the same principle) :
https://docs.aws.amazon.com/fr_fr/AmazonCloudWatch/latest/monitoring/US_AlarmAtThresholdEC2.html
If you want your instance-id from the instance itself, its instance-id is available through its metadata :
curl http://169.254.169.254/latest/meta-data/instance-id