I'm trying to add to my existing infrastructure managed by terraform a capacity provider for ECS cluster. Terraform apply returns with no errors, the new resource is added in the state file, but surprise surprise it doesn't appear in AWS GUI (ECS cluster->Capacity provider -> No results).
If I use aws cli to list this resource outputs fine, also rebuilding everything doesn't help.
Has anyone succeeded in adding capacity provider for ECS using terraform?
(I'm using provider version: “2.45.0”)
Thank you!
Please beware of [ECS] Add the ability to delete an ASG capacity provider. #632. Once created, it cannot be deleted, only to be updated.
resource "aws_ecs_cluster" "this" {
name = "${var.PROJECT}_${var.ENV}_${local.ecs_cluster_name}"
# List of short names of one or more capacity providers
capacity_providers = local.enable_ecs_cluster_auto_scaling == true ? aws_ecs_capacity_provider.asg[*].name : []
}
resource "aws_ecs_capacity_provider" "asg" {
count = local.enable_ecs_cluster_auto_scaling ? 1 : 0
name = "${var.PROJECT}-${var.ENV}-ecs-cluster-capacity-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = local.asg_ecs_cluster_arn
#--------------------------------------------------------------------------------
# When using managed termination protection, managed scaling must also be used otherwise managed termination protection will not work.
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-capacity-providers.html#capacity-providers-considerations
# Otherwise Error:
# error creating capacity provider: ClientException: The managed termination protection setting for the capacity provider is invalid.
# To enable managed termination protection for a capacity provider, the Auto Scaling group must have instance protection from scale in enabled.
#--------------------------------------------------------------------------------
managed_termination_protection = "ENABLED"
managed_scaling {
#--------------------------------------------------------------------------------
# Whether auto scaling is managed by ECS. Valid values are ENABLED and DISABLED.
# When creating a capacity provider, you can optionally enable managed scaling.
# When managed scaling is enabled, ECS manages the scale-in/out of the ASG.
#--------------------------------------------------------------------------------
status = "ENABLED"
minimum_scaling_step_size = local.ecs_cluster_autoscaling_min_step_size
maximum_scaling_step_size = local.ecs_cluster_autoscaling_max_step_size
target_capacity = local.ecs_cluster_autoscaling_target_capacity
}
}
}
This worked and confirmed auto scaling decreased EC2 instances due to low resource usage, and the service tasks (docker containers) got relocated to running EC2 instances.
AWS bug (or design)
However, after terrafom destroy, when trying to run terraform apply again:
ClientException: The specified capacity provider already exists.
Once fell in such a case, probably need to disable the capacity provider in Terraform scripts (would appear to delete the capacity provider resource, but actually it still exists due to the AWS bug).
Hence, probably the way to get around would be adding the immutable capacity provider to the cluster using CLI, providing the auto scaling group which the capacity provider points to still exists.
$ CAPACITY_PROVIDER=$(aws ecs describe-capacity-providers | jq -r '.capacityProviders[] | select(.status=="ACTIVE" and .name!="FARGATE" and .name!="FARGATE_SPOT") | .name')
$ aws ecs put-cluster-capacity-providers --cluster YOUR_ECS_CLUSTER --capacity-providers ${CAPACITY_PROVIDERS} --default-capacity-provider-strategy capacityProvider=${CAPACITY_PROVIDER},base=1,weight=1
{
"cluster": {
"clusterArn": "arn:aws:ecs:us-east-2:200506027189:cluster/YOUR_ECS_CLUSTER",
"clusterName": "YOUR_ECS_CLUSTER",
"status": "ACTIVE",
"registeredContainerInstancesCount": 0,
"runningTasksCount": 0,
"pendingTasksCount": 0,
"activeServicesCount": 0,
"statistics": [],
"tags": [],
"settings": [
{
"name": "containerInsights",
"value": "disabled"
}
],
"capacityProviders": [
"YOUR_CAPACITY_PROVIDER"
],
"defaultCapacityProviderStrategy": [
{
"capacityProvider": "YOUR_CAPACITY_PROVIDER",
"weight": 1,
"base": 1
}
],
"attachments": [
{
"id": "628ee192-4d0f-44be-85c0-049d796ed65c",
"type": "asp",
"status": "PRECREATED",
"details": [
{
"name": "capacityProviderName",
"value": "YOUR_CAPACITY_PROVIDER"
},
{
"name": "scalingPlanName",
"value": "ECSManagedAutoScalingPlan-89682dcf-bb53-492f-8329-25d75458ea11"
}
]
}
],
"attachmentsStatus": "UPDATE_IN_PROGRESS" <----- Takes time for the capacity provider to show up in ECS clsuter console
}
}
To the creation of the new resource, also a new argument is necessary to be added to the ecs_cluster module: "capacity_providers".
Related
I'm using below code to launch fargate instances, here I've a doubt what if fargate spot instance not available, does it launch an normal instance by default or raise an exception ??
I went through boto3 docs, I'm not not able to find it.
import boto3
ecs_client = boto3.client('ecs')
# Set the parameters for the task
cluster_name = 'my-cluster'
task_definition = 'my-def'
capacity_provider = 'FARGATE_SPOT'
# Launch the task
response = ecs_client.run_task(
cluster=cluster_name,
taskDefinition=task_definition,
capacityProviderStrategy=[
{
"capacityProvider": capacity_provider
}
],
networkConfiguration={
'awsvpcConfiguration': {
'subnets': [
'subnet-001',
],
'securityGroups': [
'sg-001',
],
'assignPublicIp': 'ENABLED'
}
},
)
No, it will not switch to on-demand instance on FARGATE and you will not get any raise exception. You can only see it on service metrics spot capacity not available.
Solution is to combine FARGATE and FARGATE_SPOT together with weight.
capacityProviderStrategy=[
{
'capacityProvider': 'FARGATE_SPOT',
'weight': 4
},
{
'capacityProvider': 'FARGATE',
'weight': 1
},
]
20% on FARGATE and 80% on FARGATE_SPOT
I have an AWS SageMaker domain in my account created via Terraform. The resource was modified outside of Terraform. The modification was the equivalent of the following:
aws sagemaker update-domain --domain-id d-domainid123 --default-user-settings '{"KernelGatewayAppSettings": { "CustomImages": [ { ... } ] } }'
Ever since, all terraform plan operations want to replace the AWS SageMaker domain:
# module.main.aws_sagemaker_domain.default must be replaced
-/+ resource "aws_sagemaker_domain" "default" {
~ arn = "arn:aws:sagemaker:eu-central-1:000111222333:domain/d-domainid123" -> (known after apply)
...
# (6 unchanged attributes hidden)
~ default_user_settings {
# (2 unchanged attributes hidden)
- kernel_gateway_app_settings { # forces replacement
- custom_images = [ ... ]
}
}
}
My goal is to reconcile the situation without Terraform or me needing to create a new domain. I can't modify the Terraform sources to match the state of the SageMaker domain because that would force the recreation of domains in other accounts provisioned from the same Terraform source code.
I want to issue an aws CLI command that updates the domain and removes the "KernelGatewayAppSettings": { ... } key completely from the "DefaultUserSettings" of the SageMaker domain. Is there a way to do this?
I tried the following, but the empty object is still there, so they did not work.
aws sagemaker update-domain --domain-id d-domainid123 --default-user-settings '{"KernelGatewayAppSettings": {} }'
aws sagemaker update-domain --domain-id d-domainid123 --default-user-settings '{"KernelGatewayAppSettings": null }'
# Still:
aws sagemaker describe-domain --domain-id d-domainid123
{
"DomainArn": ...,
"DomainId": ...,
...
"DefaultUserSettings": {
"ExecutionRole": "arn:aws:iam::0001112233444:role/SageMakerStudioExecutionRole",
"SecurityGroups": [
"..."
],
"KernelGatewayAppSettings": {
"CustomImages": []
}
},
...
}
One option you have is to use the lifecycle meta argument to ignore out-of-band changes to the resource.
lifecycle {
ignore_changes = [
default_user_settings
]
}
AWS ECS cluster services do not start new tasks.
Already checked:
ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
Task definition is correct, it used to work a day before and since then no changes happened.
Checked Service role contains all relevant policies
Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster yields that deployment rollout is constantly IN_PROGRESS and stays like this.
Full response with configuration is here (I've substituted real names and IDs):
{
"serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
"serviceName": "my-service",
"clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
"containerName": "my-service",
"containerPort": 8065
}
],
"serviceRegistries": [
{
"registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
"containerName": "my-service",
"containerPort": 8065
}
],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": false,
"rollback": false
},
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/deployment_id",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 0,
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"updatedAt": "2022-06-28T09:15:08.241000+02:00",
"launchType": "EC2",
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
}
],
"roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
"events": [],
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"placementConstraints": [],
"placementStrategy": [
{
"type": "spread",
"field": "attribute:ecs.availability-zone"
}
],
"healthCheckGracePeriodSeconds": 120,
"schedulingStrategy": "REPLICA",
"createdBy": "arn:aws:iam::my-account-id:role/my-role",
"enableECSManagedTags": false,
"propagateTags": "NONE",
"enableExecuteCommand": false
}
The ECS service and service discovery entry is created using Terraform, and the service definition is
resource "aws_service_discovery_service" "ecs_discovery_service" {
name = var.service_name
dns_config {
namespace_id = var.service_discovery_hosted_zone_id
dns_records {
ttl = 10
type = "SRV"
}
}
health_check_custom_config {
failure_threshold = 1
}
}
resource "aws_ecs_service" "ecs_service" {
name = var.service_name
cluster = var.ecs_cluster_id
task_definition = var.task_definition_arn
desired_count = var.desired_count
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
health_check_grace_period_seconds = var.health_check_grace_period_seconds
target_group_arn = aws_lb_target_group.target_group.arn
container_name = var.service_name
container_port = var.service_container_port
ordered_placement_strategy {
type = "spread"
field = "attribute:ecs.availability-zone"
}
service_registries {
registry_arn = aws_service_discovery_service.ecs_discovery_service.arn
container_name = var.service_name
container_port = var.service_container_port
}
}
This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.
I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal.
Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.
Again, the Service Role has the permissions for the service discovery.
"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"
I can't find any ways to trace this strange behaviour and want to ask you guys for help:
could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.
I would be very thankful for any hints and let me know if you need any details.
Have a nice day and best regards.
I’m trying to get an AWS Auto Scaling Group to replace ‘unhealthy’ instances, but I can’t get it to work.
From the console, I’ve created a Launch Configuration and, from there, an Auto Scaling Group with an Application Load Balancer. I've kept all settings regarding the target group and listeners the same as the default settings. I’ve selected ‘ELB’ as an additional health check type for the Auto Scaling Group. I’ve consciously misconfigured the Launch Configuration to result in ‘broken’ instances -- there is no web server to listen to the port configured in the listener.
The Auto Scaling Group seems to be configured correctly and is definitely aware of the load balancer. However, it thinks the instance it has spun up is healthy.
// output of aws autoscaling describe-auto-scaling-groups:
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "MyAutoScalingGroup",
"AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<accountId>:autoScalingGroup:3edc728f-0831-46b9-bbcc-16691adc8f44:autoScalingGroupName/MyAutoScalingGroup",
"LaunchConfigurationName": "MyLaunchConfiguration",
"MinSize": 1,
"MaxSize": 3,
"DesiredCapacity": 1,
"DefaultCooldown": 300,
"AvailabilityZones": [
"eu-west-1b",
"eu-west-1c",
"eu-west-1a"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:eu-west-1:<accountId>:targetgroup/MyAutoScalingGroup-1/1e36c863abaeb6ff"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-0b589d33100e4e515",
// ...
"LifecycleState": "InService",
"HealthStatus": "Healthy",
// ...
}
],
// ...
}
]
}
The load balancer, however, is very much aware that the instance is unhealthy:
// output of aws elbv2 describe-target-health:
{
"TargetHealthDescriptions": [
{
"Target": {
"Id": "i-0b589d33100e4e515",
"Port": 80
},
"HealthCheckPort": "80",
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.Timeout",
"Description": "Request timed out"
}
}
]
}
Did I just misunderstand the documentation? If not, what else is needed to be done to get the Auto Scaling Group to understand that this instance is not healthy and refresh it?
To be clear, when instances are marked unhealthy manually (i.e. using aws autoscaling set-instance-health), they are refreshed as is expected.
Explanation
If you have consciously misconfigured the instance from the start and the ELB Health Check has never passed, then the Auto Scaling Group does not acknowledge yet that your ELB/Target Group is up and running. See this page of the documentation.
After at least one registered instance passes the health checks, it enters the InService state.
And
If no registered instances pass the health checks (for example, due to a misconfigured health check), ... Amazon EC2 Auto Scaling doesn't terminate and replace the instances.
I configured from scratch and arrived at the same behavior as what you described. To verify that this is indeed the root cause, check the Target Group status in the ASG. It is probably in Added state instead of InService.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/asg-test-1/abc",
"State": "Added"
}
Resolution
To achieve the desired behavior, what I did was
Run a simple web service on port 80. Ensure Security Group is open for the ELB to talk to EC2.
Wait until the ELB status is healthy. Ensure server is returning 200. You may need to create an empty index.html just to pass the health check.
Wait until the target group status has become InService in the ASG.
For example, for Step 3:
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abcdef",
"State": "InService"
}
]
}
Now that it is in service, turn off the web server and wait. Check often, though, as once ASG detects it is unhealthy it will terminate.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-auto-scaling-groups
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "test-asg",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:xxx:autoScalingGroup:abc-def-ghi:autoScalingGroupName/test-asg",
...
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abc"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-04bed6ef3b2000326",
"InstanceType": "t2.micro",
"AvailabilityZone": "us-east-1b",
"LifecycleState": "Terminating",
"HealthStatus": "Unhealthy",
"LaunchTemplate": {
"LaunchTemplateId": "lt-0452c90319362cbc5",
"LaunchTemplateName": "test-template",
"Version": "1"
},
...
},
...
]
}
Context
I am running an application (Apache Airflow) on EKS, that spins up new workers to fulfill new tasks. Every worker is required to spin up a new pod. I am afraid to run out of memory and/or CPU when there are several workers being spawned. My objective is to trigger auto-scaling.
What I have tried
I am using Terraform for provisioning (also happy to have answers that are not in Terraform, which i can conceptually transform to Terraform code).
I have setup a fargate profile like:
# Create EKS Fargate profile
resource "aws_eks_fargate_profile" "airflow" {
cluster_name = module.eks_cluster.cluster_id
fargate_profile_name = "${var.project_name}-fargate-${var.env_name}"
pod_execution_role_arn = aws_iam_role.fargate_iam_role.arn
subnet_ids = var.private_subnet_ids
selector {
namespace = "fargate"
}
tags = {
Terraform = "true"
Project = var.project_name
Environment = var.env_name
}
}
My policy for auto scaling the nodes:
# Create IAM Policy for node autoscaling
resource "aws_iam_policy" "node_autoscaling_pol" {
name = "${var.project_name}-node-autoscaling-${var.env_name}"
policy = data.aws_iam_policy_document.node_autoscaling_pol_doc.json
}
# Create autoscaling policy
data "aws_iam_policy_document" "node_autoscaling_pol_doc" {
statement {
actions = [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
]
effect = "Allow"
resources = ["*"]
}
}
And finally a (just a snippet for brevity):
# Create EKS Cluster
module "eks_cluster" {
cluster_name = "${var.project_name}-${var.env_name}"
# Assigning worker groups
worker_groups = [
{
instance_type = var.nodes_instance_type_1
asg_max_size = 1
name = "${var.project_name}-${var.env_name}"
}
]
}
Question
Is increasing the asg_max_size sufficient for auto scaling? I have a feeling that I need to set something where along the lines of: "When memory exceeds X do y" but I am not sure.
I don't have so much experience with advanced monitoring/metrics tools, so a somewhat simple solution that does basic auto-scaling would be the best fit for my needs = )
This is handled by a tool called cluster-autoscaler. You can find the EKS guide for it at https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html or the project itself at https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler