Identifying AWS unhealthy hosts in target group

Identifying AWS unhealthy hosts in target group - amazon-web-services

I have multiple targetGroups running behind an AWS ALB. Moreover, I set cloudwatch alarms to monitor the health of those target groups. Whenever, a target in this group generates an unhealthyhost alarm, AWS sends an email alerting me that a target in this group suffered from being unhealthy. However, the alarm, and the email does not notify me which instance from this target group is unhealthy.
Is there a way or a solution to implement and find out which target in this group is the one who caused the issue?

# Trigger the lambda function when an EC2 instance is terminated
resource "aws_cloudwatch_event_rule" "instance_terminated" {
name = "ChefCleanup-InstanceTerminated-rule"
description = "Capture each EC2 instance termination"
event_pattern = <<PATTERN
{
"source": [ "aws.ec2" ],
"detail-type": [ "EC2 Instance State-change Notification" ],
"detail": {
"state": [ "terminated" ]
}
}
PATTERN
}
# Set event target to call the lambda
resource "aws_cloudwatch_event_target" "instance_terminated" {
rule = "${aws_cloudwatch_event_rule.instance_terminated.name}"
target_id = "chef-cleanup-lambda"
arn = "${aws_lambda_function.chef_cleanup.arn}"
}

Related

AWS Eventbridge Filter ECS Cluster Deployment with Terraform

I am trying to build a simple Eventbridge -> SNS -> AWS Chatbot to notify Slack channel for any ECS deployment events. Below are my codes
resource "aws_cloudwatch_event_rule" "ecs_deployment" {
name = "${var.namespace}-${var.environment}-infra-ecs-deployment"
description = "This rule sends notification on the all app ECS Fargate deployments with respect to the environment."
event_pattern = <<EOF
{
"source": ["aws.ecs"],
"detail-type": ["ECS Deployment State Change"],
"detail": {
"clusterArn": [
{
"prefix": "arn:aws:ecs:<REGION>:<ACCOUNT>:cluster/${var.namespace}-${var.environment}-"
}
]
}
}
EOF
tags = {
Environment = "${var.environment}"
Origin = "terraform"
}
}
resource "aws_cloudwatch_event_target" "ecs_deployment" {
rule = aws_cloudwatch_event_rule.ecs_deployment.name
target_id = "${var.namespace}-${var.environment}-infra-ecs-deployment"
arn = aws_sns_topic.ecs_deployment.arn
}
resource "aws_sns_topic" "ecs_deployment" {
name = "${var.namespace}-${var.environment}-infra-ecs-deployment"
display_name = "${var.namespace} ${var.environment}"
}
resource "aws_sns_topic_policy" "default" {
arn = aws_sns_topic.ecs_deployment.arn
policy = data.aws_iam_policy_document.sns_topic_policy.json
}
data "aws_iam_policy_document" "sns_topic_policy" {
statement {
effect = "Allow"
actions = ["SNS:Publish"]
principals {
type = "Service"
identifiers = ["events.amazonaws.com"]
}
resources = [aws_sns_topic.ecs_deployment.arn]
}
}
Based on the above code, Terraform will create AWS Eventbridge rule and with SNS target. From there, I create AWS Chatbot in the console, and subscribe to the SNS.
The problem is, when I try to remove the detail, it works. But what I want is to filter the events to be coming from cluster with mentioned prefix.
Is this possible? Or did I do it the wrong way?
Any help is appreciated.

AWS Auto Scaling Group does not detect instance is unhealthy from ELB

I’m trying to get an AWS Auto Scaling Group to replace ‘unhealthy’ instances, but I can’t get it to work.
From the console, I’ve created a Launch Configuration and, from there, an Auto Scaling Group with an Application Load Balancer. I've kept all settings regarding the target group and listeners the same as the default settings. I’ve selected ‘ELB’ as an additional health check type for the Auto Scaling Group. I’ve consciously misconfigured the Launch Configuration to result in ‘broken’ instances -- there is no web server to listen to the port configured in the listener.
The Auto Scaling Group seems to be configured correctly and is definitely aware of the load balancer. However, it thinks the instance it has spun up is healthy.
// output of aws autoscaling describe-auto-scaling-groups:
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "MyAutoScalingGroup",
"AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<accountId>:autoScalingGroup:3edc728f-0831-46b9-bbcc-16691adc8f44:autoScalingGroupName/MyAutoScalingGroup",
"LaunchConfigurationName": "MyLaunchConfiguration",
"MinSize": 1,
"MaxSize": 3,
"DesiredCapacity": 1,
"DefaultCooldown": 300,
"AvailabilityZones": [
"eu-west-1b",
"eu-west-1c",
"eu-west-1a"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:eu-west-1:<accountId>:targetgroup/MyAutoScalingGroup-1/1e36c863abaeb6ff"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-0b589d33100e4e515",
// ...
"LifecycleState": "InService",
"HealthStatus": "Healthy",
// ...
}
],
// ...
}
]
}
The load balancer, however, is very much aware that the instance is unhealthy:
// output of aws elbv2 describe-target-health:
{
"TargetHealthDescriptions": [
{
"Target": {
"Id": "i-0b589d33100e4e515",
"Port": 80
},
"HealthCheckPort": "80",
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.Timeout",
"Description": "Request timed out"
}
}
]
}
Did I just misunderstand the documentation? If not, what else is needed to be done to get the Auto Scaling Group to understand that this instance is not healthy and refresh it?
To be clear, when instances are marked unhealthy manually (i.e. using aws autoscaling set-instance-health), they are refreshed as is expected.

Explanation
If you have consciously misconfigured the instance from the start and the ELB Health Check has never passed, then the Auto Scaling Group does not acknowledge yet that your ELB/Target Group is up and running. See this page of the documentation.
After at least one registered instance passes the health checks, it enters the InService state.
And
If no registered instances pass the health checks (for example, due to a misconfigured health check), ... Amazon EC2 Auto Scaling doesn't terminate and replace the instances.
I configured from scratch and arrived at the same behavior as what you described. To verify that this is indeed the root cause, check the Target Group status in the ASG. It is probably in Added state instead of InService.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/asg-test-1/abc",
"State": "Added"
}
Resolution
To achieve the desired behavior, what I did was
Run a simple web service on port 80. Ensure Security Group is open for the ELB to talk to EC2.
Wait until the ELB status is healthy. Ensure server is returning 200. You may need to create an empty index.html just to pass the health check.
Wait until the target group status has become InService in the ASG.
For example, for Step 3:
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abcdef",
"State": "InService"
}
]
}
Now that it is in service, turn off the web server and wait. Check often, though, as once ASG detects it is unhealthy it will terminate.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-auto-scaling-groups
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "test-asg",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:xxx:autoScalingGroup:abc-def-ghi:autoScalingGroupName/test-asg",
...
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abc"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-04bed6ef3b2000326",
"InstanceType": "t2.micro",
"AvailabilityZone": "us-east-1b",
"LifecycleState": "Terminating",
"HealthStatus": "Unhealthy",
"LaunchTemplate": {
"LaunchTemplateId": "lt-0452c90319362cbc5",
"LaunchTemplateName": "test-template",
"Version": "1"
},
...
},
...
]
}

AWS Cloudtrail Event for S3 Bucket in Terraform

I had quite a hard time setting up an automization with Beanstalk and Codepipeline...
I finally got it running, the main issue was the S3 Cloudwatch event to trigger the start of the Codepipeline. I missed the Cloudtrail part which is necessary and I couldn't find that in any documentation.
So the current Setup is:
S3 file gets uploaded -> a CloudWatch Event triggers the Codepipeline -> Codepipeline deploys to ElasticBeanstalk env.
As I said to get the CloudWatch Event trigger you need a Cloudtrail trail like:
resource "aws_cloudtrail" "example" {
# ... other configuration ...
name = "codepipeline-source-trail" #"codepipeline-${var.project_name}-trail"
is_multi_region_trail = true
s3_bucket_name = "codepipeline-cloudtrail-placeholder-bucket-eu-west-1"
event_selector {
read_write_type = "WriteOnly"
include_management_events = true
data_resource {
type = "AWS::S3::Object"
values = ["${data.aws_s3_bucket.bamboo-deploy-bucket.arn}/${var.project_name}/file.zip"]
}
}
}
But this is only to create a new trail. The problem is that AWS only allows 5 trails max. On the AWS console you can add multiple data events to one trail, but I couldn't manage to do this in terraform. I tried to use the same name, but this just raises an error
"Error creating CloudTrail: TrailAlreadyExistsException: Trail codepipeline-source-trail already exists for customer: XXXX"
I tried my best to explain my problem. Not sure if it is understandable.
In a nutshell: I want to add a data events:S3 in an existing cloudtrail trail with terraform.
Thx for help,
Daniel

As I said to get the CloudWatch Event trigger you need a Cloudtrail trail like:
You do not need multiple CloudTrail to invoke a CloudWatch Event. You can create service-specific rules as well.
Create a CloudWatch Events rule for an Amazon S3 source (console)
From CloudWatch event rule to invoke CodePipeline as a target. Let's say you created this event rule
{
"source": [
"aws.s3"
],
"detail-type": [
"AWS API Call via CloudTrail"
],
"detail": {
"eventSource": [
"s3.amazonaws.com"
],
"eventName": [
"PutObject"
]
}
}
You add CodePipeline as a target for this rule and eventually, Codepipeline deploys to ElasticBeanstalk env.

Have you tried to add multiple data_resources to your current trail instead of adding a new trail with the same name:
resource "aws_cloudtrail" "example" {
# ... other configuration ...
name = "codepipeline-source-trail" #"codepipeline-${var.project_name}-trail"
is_multi_region_trail = true
s3_bucket_name = "codepipeline-cloudtrail-placeholder-bucket-eu-west-1"
event_selector {
read_write_type = "WriteOnly"
include_management_events = true
data_resource {
type = "AWS::S3::Object"
values = ["${data.aws_s3_bucket.bamboo-deploy-bucket.arn}/${var.project_A}/file.zip"]
}
data_resource {
type = "AWS::S3::Object"
values = ["${data.aws_s3_bucket.bamboo-deploy-bucket.arn}/${var.project_B}/fileB.zip"]
}
}
}
You should be able to add up to 250 data resources (across all event selectors in a trail), and up to 5 event selectors to your current trail (CloudTrail quota limits)

Is there a way to find out if a EC2 instance is associated with Auto scaling group

Is there a way to find out if an EC2 instance is associated with Auto Scaling Group?

You can use the describe-auto-scaling-instances function to check which autoscaling group the instance is attached to.
So for example for instance id i-4ba0837f you could run the following command
aws autoscaling describe-auto-scaling-instances --instance-ids i-4ba0837f
Example response if attached to an autoscaling group is below
{
"AutoScalingInstances": [
{
"ProtectedFromScaleIn": false,
"AvailabilityZone": "us-west-2c",
"InstanceId": "i-4ba0837f",
"AutoScalingGroupName": "my-auto-scaling-group",
"HealthStatus": "HEALTHY",
"LifecycleState": "InService",
"LaunchConfigurationName": "my-launch-config"
}
]
}
However if it is not attached to any this will be an empty list.
{
"AutoScalingInstances": []
}
If this returns no results then that instance is not part of an autoscaling group.
This will also be available in the SDKs:
Boto3
NodeJS
Java
C#
PHP

You can use the following aws cli command:
aws autoscaling describe-auto-scaling-instances --instance-ids i-exampleid
If the instance is part of an auto scaling group, the result will give you the details.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/describe-auto-scaling-instances.html

You can also look at the instance tags to find out whether it belongs to an ASG. An EC2 in an ASG will always have the aws:autoscaling:groupName tag.

You may use describe-auto-scaling-instances
aws autoscaling describe-auto-scaling-instances --instance-ids your-instance-id
It will print something like this, if exists
{
"AutoScalingInstances": [
{
"InstanceId": "some-instance-id",
"InstanceType": "m4.large",
"AutoScalingGroupName": "awseb-some-name",
"AvailabilityZone": "eu-west-1c",
"LifecycleState": "InService",
"HealthStatus": "HEALTHY",
"LaunchTemplate": {
"LaunchTemplateId": "lt-04a2fffdesa",
"LaunchTemplateName": "AWSEBEC2LaunchTemplate_foobar",
"Version": "2"
},
"ProtectedFromScaleIn": false
}
]
}
if not
{
"AutoScalingInstances": []
}

AWS ECS capacity provider using terraform

I'm trying to add to my existing infrastructure managed by terraform a capacity provider for ECS cluster. Terraform apply returns with no errors, the new resource is added in the state file, but surprise surprise it doesn't appear in AWS GUI (ECS cluster->Capacity provider -> No results).
If I use aws cli to list this resource outputs fine, also rebuilding everything doesn't help.
Has anyone succeeded in adding capacity provider for ECS using terraform?
(I'm using provider version: “2.45.0”)
Thank you!

Please beware of [ECS] Add the ability to delete an ASG capacity provider. #632. Once created, it cannot be deleted, only to be updated.
resource "aws_ecs_cluster" "this" {
name = "${var.PROJECT}_${var.ENV}_${local.ecs_cluster_name}"
# List of short names of one or more capacity providers
capacity_providers = local.enable_ecs_cluster_auto_scaling == true ? aws_ecs_capacity_provider.asg[*].name : []
}
resource "aws_ecs_capacity_provider" "asg" {
count = local.enable_ecs_cluster_auto_scaling ? 1 : 0
name = "${var.PROJECT}-${var.ENV}-ecs-cluster-capacity-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = local.asg_ecs_cluster_arn
#--------------------------------------------------------------------------------
# When using managed termination protection, managed scaling must also be used otherwise managed termination protection will not work.
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-capacity-providers.html#capacity-providers-considerations
# Otherwise Error:
# error creating capacity provider: ClientException: The managed termination protection setting for the capacity provider is invalid.
# To enable managed termination protection for a capacity provider, the Auto Scaling group must have instance protection from scale in enabled.
#--------------------------------------------------------------------------------
managed_termination_protection = "ENABLED"
managed_scaling {
#--------------------------------------------------------------------------------
# Whether auto scaling is managed by ECS. Valid values are ENABLED and DISABLED.
# When creating a capacity provider, you can optionally enable managed scaling.
# When managed scaling is enabled, ECS manages the scale-in/out of the ASG.
#--------------------------------------------------------------------------------
status = "ENABLED"
minimum_scaling_step_size = local.ecs_cluster_autoscaling_min_step_size
maximum_scaling_step_size = local.ecs_cluster_autoscaling_max_step_size
target_capacity = local.ecs_cluster_autoscaling_target_capacity
}
}
}
This worked and confirmed auto scaling decreased EC2 instances due to low resource usage, and the service tasks (docker containers) got relocated to running EC2 instances.
AWS bug (or design)
However, after terrafom destroy, when trying to run terraform apply again:
ClientException: The specified capacity provider already exists.
Once fell in such a case, probably need to disable the capacity provider in Terraform scripts (would appear to delete the capacity provider resource, but actually it still exists due to the AWS bug).
Hence, probably the way to get around would be adding the immutable capacity provider to the cluster using CLI, providing the auto scaling group which the capacity provider points to still exists.
$ CAPACITY_PROVIDER=$(aws ecs describe-capacity-providers | jq -r '.capacityProviders[] | select(.status=="ACTIVE" and .name!="FARGATE" and .name!="FARGATE_SPOT") | .name')
$ aws ecs put-cluster-capacity-providers --cluster YOUR_ECS_CLUSTER --capacity-providers ${CAPACITY_PROVIDERS} --default-capacity-provider-strategy capacityProvider=${CAPACITY_PROVIDER},base=1,weight=1
{
"cluster": {
"clusterArn": "arn:aws:ecs:us-east-2:200506027189:cluster/YOUR_ECS_CLUSTER",
"clusterName": "YOUR_ECS_CLUSTER",
"status": "ACTIVE",
"registeredContainerInstancesCount": 0,
"runningTasksCount": 0,
"pendingTasksCount": 0,
"activeServicesCount": 0,
"statistics": [],
"tags": [],
"settings": [
{
"name": "containerInsights",
"value": "disabled"
}
],
"capacityProviders": [
"YOUR_CAPACITY_PROVIDER"
],
"defaultCapacityProviderStrategy": [
{
"capacityProvider": "YOUR_CAPACITY_PROVIDER",
"weight": 1,
"base": 1
}
],
"attachments": [
{
"id": "628ee192-4d0f-44be-85c0-049d796ed65c",
"type": "asp",
"status": "PRECREATED",
"details": [
{
"name": "capacityProviderName",
"value": "YOUR_CAPACITY_PROVIDER"
},
{
"name": "scalingPlanName",
"value": "ECSManagedAutoScalingPlan-89682dcf-bb53-492f-8329-25d75458ea11"
}
]
}
],
"attachmentsStatus": "UPDATE_IN_PROGRESS" <----- Takes time for the capacity provider to show up in ECS clsuter console
}
}

To the creation of the new resource, also a new argument is necessary to be added to the ecs_cluster module: "capacity_providers".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js