How to autoscale Servers in ECS? - amazon-web-services

I recently started using ECS. I was able to deploy a container image in ECR and create task definition for my container with CPU/Memory limits. My use case is that each container will be a long running app (no webserver, no port mapping needed). The containers will be spawned on demand 1 at a time and deleted on demand 1 at a time.
I am able to create a cluster with N server instances. But I'd like to be able for the server instances to automatically scale up/down. For example if there isn't enough CPU/Memory in the cluster, I'd like a new instance to be created.
And if there is an instance with no containers running in it, I'd like that specific instance to be scaled down / deleted. This is to avoid auto scale down termination of a server instance that has running tasks in it.
What steps are needed to be able to achieve this?

Considering that you already have an ECS Cluster created, AWS provides instructions on Scaling cluster instances with CloudWatch Alarms.
Assuming that you want to scale the cluster based on the memory reservation, at a high level, you would need to do the following:
Create an Launch Configuration for your Auto Scaling Group. This
Create an Auto Scaling Group, so that the size of the cluster can be scaled up and down.
Create a CloudWatch Alarm to scale the cluster up if the memory reservation is over 70%
Create a CloudWatch Alarm to scale the cluster down if the memory reservation is under 30%
Because it's more of my specialty I wrote up an example CloudFormation template that should get you started for most of this:
Parameters:
MinInstances:
Type: Number
MaxInstances:
Type: Number
InstanceType:
Type: String
AllowedValues:
- t2.nano
- t2.micro
- t2.small
- t2.medium
- t2.large
VpcSubnetIds:
Type: String
Mappings:
EcsInstanceAmis:
us-east-2:
Ami: ami-1c002379
us-east-1:
Ami: ami-9eb4b1e5
us-west-2:
Ami: ami-1d668865
us-west-1:
Ami: ami-4a2c192a
eu-west-2:
Ami: ami-cb1101af
eu-west-1:
Ami: ami-8fcc32f6
eu-central-1:
Ami: ami-0460cb6b
ap-northeast-1:
Ami: ami-b743bed1
ap-southeast-2:
Ami: ami-c1a6bda2
ap-southeast-1:
Ami: ami-9d1f7efe
ca-central-1:
Ami: ami-b677c9d2
Resources:
Cluster:
Type: AWS::ECS::Cluster
Role:
Type: AWS::IAM::Role
Properties:
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Action:
- sts:AssumeRole
Principal:
Service:
- ec2.amazonaws.com
InstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref Role
LaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
ImageId: !FindInMap [EcsInstanceAmis, !Ref "AWS::Region", Ami]
InstanceType: !Ref InstanceType
IamInstanceProfile: !Ref InstanceProfile
UserData:
Fn::Base64: !Sub |
#!/bin/bash
echo ECS_CLUSTER=${Cluster} >> /etc/ecs/ecs.config
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: !Ref MinInstances
MaxSize: !Ref MaxInstances
LaunchConfigurationName: !Ref LaunchConfiguration
HealthCheckGracePeriod: 300
HealthCheckType: EC2
VPCZoneIdentifier: !Split [",", !Ref VpcSubnetIds]
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '1'
MemoryReservationAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '70'
AlarmDescription: Alarm if Cluster Memory Reservation is to high
Period: '60'
AlarmActions:
- Ref: ScaleUpPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref Cluster
ComparisonOperator: GreaterThanThreshold
MetricName: MemoryReservation
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '-1'
MemoryReservationAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '30'
AlarmDescription: Alarm if Cluster Memory Reservation is to Low
Period: '60'
AlarmActions:
- Ref: ScaleDownPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref Cluster
ComparisonOperator: LessThanThreshold
MetricName: MemoryReservation
This creates an ECS Cluster, a Launch Configuration, An AutoScaling Group, As well as the Alarms based on the ECS Memory Reservation.
Now we can get to the interesting discussions.
Why can't we scale up based on the CPU Utilization And Memory Reservation?
The short answer is you totally can But you're likely to pay a lot for it. EC2 has a known property that when you create an instance, you pay for a minimum of 1 hour, because partial instance hours are charged as full hours. Why that's relevant is, imagine you have multiple alarms. Say you have a bunch of services that are currently running idle, and you fill the cluster. Either the CPU Alarm scales down the cluster, or the Memory Alarm scales up the cluster. One of these will likely scale the cluster to the point that it's alarm is no longer triggered. After the cooldown, period, the other alarm will undo it's last action, After the next cooldown, the action will likely be redone. Thus instances are created then destroyed repeatedly on every other cooldown.
After giving a bunch of thought to this, the strategy that I came up with was to use Application Autoscaling for ECS Services based on CPU Utilization, and Memory Reservation based on the cluster. So if one service is running hot, an extra task will be added to share the load. This will slowly fill the cluster memory reservation capacity. When the memory gets full, the cluster scales up. When a service is cooling down, the services will start shutting down tasks. As the memory reservation on the cluster drops, the cluster will be scaled down.
The thresholds for the CloudWatch Alarms might need to be experimented with, based on your task definitions. The reason for this is that if you put the scale up threshold too high, it may not scale up as the memory gets consumed, and then when autoscaling goes to place another task, it will find that there isn't enough memory available on any instance in the cluster, and therefore be unable to place another task.

As part of this year's re:Invent conference, AWS announced cluster auto scaling for Amazon ECS. Clusters configured with auto scaling can now add more capacity when needed and remove capacity that is not necessary. You can find more information about this in the documentation.
However, depending on what you're trying to run, AWS Fargate could be a better option. Fargate allows you to run containers without provisioning and managing the underlying infrastructure; i.e., you don't have to deal with any EC2 instances. With Fargate, you can make an API call to run your container, the container can run, and then there's nothing to clean up once the container stops running. Fargate is billed per-second (with a 1-minute minimum) and is priced based on the amount of CPU and memory allocated (see here for details).

Related

ECS EC2 Autoscaling gets stuck

I'm trying to learn autoscaling for ECS with EC2 launch type.
Without the autoscaling part, everything works well.
When I add the autoscaling part, or the Scalable Target the Alarm and Policy for both, scaling in and out, the service gets stuck in the event:
service ecs-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance XXX has insufficient CPU units available.
If I look at the service the desired capacity is stuck in 4, pending is 0 and running is 1.
In relation to the alarms, the high cpu usage alarm is OK and the low cpu usage alarm is In alarm.
The Task Definition has 1024 MB assigned to CPU and 1024 MB to Memory.
The Container has 1024 MB assigned to CPU and 1024 MB to Memory.
And I have been waiting for more than 40 minutes.
What would I expect?
I'm setting a low threshold for high CPU Usage (20%) to make the alarm react easily.
Then, I increase the quantity up to 4, checking the used CPU percentage.
This should work in both ways, when is adding and when it is removing. So, it should add up to 4 when high is enabled and go down to 1 when low is enabled.
Here's the entire chain of events without tasks ids, dates and events ids to simplify its reading.
service ecs-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance XXX has insufficient CPU units available. For more
service ecs-service registered 1 targets in target-group ecs-target
service ecs-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance XXX has insufficient CPU units available. For more information, see the Troubleshooting section.
Message: Successfully set desired count to 4. Waiting for change to be fulfilled by ecs. Cause: monitor alarm high-cpu-usage in state ALARM triggered policy ecs-high-policy
service ecs-service has started 1 tasks: task
service ecs-service has stopped 1 running tasks: task
service ecs-service deregistered 1 targets in target-group ecs-target
service ecs-service (instance XXX) (port 8080) is unhealthy in target-group ecs-target due to (reason Health checks failed)
service ecs-service has started 1 tasks: task
service ecs-service was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section.
Message: Successfully set desired count to 4. Found it was later changed to 0. Cause: monitor alarm high-cpu-usage in state ALARM triggered policy ecs-high-policy
Message: Successfully set desired count to 4. Found it was later changed to 0. Cause: monitor alarm high-cpu-usage in state ALARM triggered policy ecs-high-policy
Message: Successfully set desired count to 3. Change successfully fulfilled by ecs. Cause: monitor alarm high-cpu-usage in state ALARM triggered policy ecs-high-policy
Message: Successfully set desired count to 2. Change successfully fulfilled by ecs. Cause: monitor alarm high-cpu-usage in state ALARM triggered policy ecs-high-policy
This is my Scalable Target, Alarms and Policies:
The service uses a Load Balancer.
ServiceScalableTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
DependsOn: Service
Properties:
MaxCapacity: !Ref MaxSize
MinCapacity: !Ref MinSize
ResourceId:
Fn::Join:
- '/'
- - 'service'
- Ref: Cluster
- Fn::GetAtt:
- Service
- 'Name'
RoleARN:
Fn::ImportValue: !Ref ECSAutoScalingRole
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
HighCpuUsageAlarm:
Type: AWS::CloudWatch::Alarm
DependsOn: ScalingPolicyHigh
Properties:
AlarmName: high-cpu
MetricName: CPUUtilization
Namespace: AWS/ECS
Dimensions:
- Name: ServiceName
Value: !Ref ServiceName
- Name: ClusterName
Value: !Ref Cluster
Statistic: Average
Period: 300
EvaluationPeriods: 1
Threshold: 20
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref ScalingPolicyHigh
ScalingPolicyHigh:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: olicy-high
PolicyType: StepScaling
ScalingTargetId:
Ref: ServiceScalableTarget
StepScalingPolicyConfiguration:
AdjustmentType: ChangeInCapacity
Cooldown: 600
MetricAggregationType: Average
StepAdjustments:
- MetricIntervalLowerBound: 0
MetricIntervalUpperBound: 15
ScalingAdjustment: 1
- MetricIntervalLowerBound: 15
MetricIntervalUpperBound: 25
ScalingAdjustment: 2
- MetricIntervalLowerBound: 25
ScalingAdjustment: 3
LowCpuUsageAlarm:
Type: AWS::CloudWatch::Alarm
DependsOn: ScalingPolicyLow
Properties:
AlarmName: low-cpu
MetricName: CPUUtilization
Namespace: AWS/ECS
Dimensions:
- Name: ServiceName
Value: !Ref ServiceName
- Name: ClusterName
Value: !Cluster
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 15
ComparisonOperator: LessThanOrEqualToThreshold
AlarmActions:
- !Ref ScalingPolicyLow
ScalingPolicyLow:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: policy-low
PolicyType: StepScaling
ScalingTargetId:
Ref: ServiceScalableTarget
StepScalingPolicyConfiguration:
AdjustmentType: ChangeInCapacity
Cooldown: 600
MetricAggregationType: Average
StepAdjustments:
- MetricIntervalLowerBound: -15
MetricIntervalUpperBound: 0
ScalingAdjustment: -1
- MetricIntervalLowerBound: -25
MetricIntervalUpperBound: -15
ScalingAdjustment: -2
- MetricIntervalUpperBound: -25
ScalingAdjustment: -3
I'd appreciate help. I cannot make it work properly.

CloudWatch Alarm in EC2 template

I am setting up an AWS EC2 template based on a custom image for launching instances for a certain purpose. These instances then also need CloudWatch alarms monitoring their activity and perform some action based on them (e.g. stop instance if inactive for 30 min.).
Is there any way I can include such alarms into the EC2 template? I would like to avoid having to manually add the alarms to the instance after creation. I couldn't find this as an option anywhere in the template creation dialogue.
From management console - could not find a straight forward option.
Using EC2 Tags, Lambda and other services - might be possible - check the link
CloudFormation - you can write a CF template to create EC2 and add an alarm to it. You can continue enhancing it.
This option will make things easier once the template is created as you will not need to select various UI options whenever you launch new EC2 and add alarm.
This template will ask for instance type, will create an alarm for EC2 and publish to an SNS topic.
Verify AMI, AZ if you are logged into a different region.
Parameters:
InstanceType:
Description: EC2 instance type
Type: String
Default: t2.small
AllowedValues:
- t1.micro
- t2.nano
- t2.micro
- t2.small
ConstraintDescription: It must be a valid EC2 instance type.
Resources:
MyInstance1:
Type: AWS::EC2::Instance
Properties:
AvailabilityZone: us-east-1a
ImageId: ami-05912b6333beaa478
InstanceType: !Ref InstanceType
KeyName: KP-EC2-Lambda
SecurityGroups:
- launch-wizard-2
CPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: CPU alarm for my instance
AlarmActions:
- Ref: "MyTopic1"
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '60'
EvaluationPeriods: '3'
Threshold: '90'
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: InstanceId
Value:
Ref: "MyInstance1"
MyTopic1:
Type: AWS::SNS::Topic
Properties:
DisplayName: MyTopic1
Subscription:
- Endpoint: "xyz#xyz.com"
Protocol: "email"
TopicName: MyTopic1

AWS Capacity Provider for ECS cluster not triggering scale-in event

I have added a Capacity Provider to an ECS cluster. While scale-out events work as expected due to changes in CapacityProviderReservation metric, scale-in events do not work.
In my case, the TargetCapacity property is set to 90, but looking at CloudWatch the average for the CapacityProviderReservation metric currently sits at 50%. This has been the case for the last 16 hours.
According to AWS's own documentation, scale-in events occur -
When using dynamic scaling policies and the size of the group decreases as a result of changes in a metric's value
So it seems like the Capacity Provider is not changing the desired size of the ASG as expected.
Am I missing something here, or do capacity providers tied to ASG's simply not work both ways?
ASG and Capacity Provider resources in CloudFormation
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Sub ${ResourceNamePrefix}-asg
VPCZoneIdentifier:
- !Ref PrivateSubnetAId
LaunchTemplate:
LaunchTemplateId: !Ref Ec2LaunchTemplate
Version: !GetAtt Ec2LaunchTemplate.LatestVersionNumber
MinSize: 0
MaxSize: 3
DesiredCapacity: 1
EcsCapacityProvider:
Type: AWS::ECS::CapacityProvider
Properties:
Name: !Sub ${ResourceNamePrefix}-ecs-capacity-provider
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref AutoScalingGroup
ManagedScaling:
Status: ENABLED
TargetCapacity: 90
ManagedTerminationProtection: DISABLED
Dynamic scaling policy for ASG
Current status of the CapacityProviderReservation metric
The CapacityProviderReservation metric has been at 50% for well over 12 hours.
Current status of the Capacity Provider
As you can see, the desired size is still 2, while it is expected that this should have dropped back to 1.
Update
After deleting and recreating the cluster, I notice that the Capacity Provider changes the DesiredCapacity to 2 instantly, even though there are no tasks running.

CloudFormation to automate creation of a new EC2 instance & volume

I need to do the following actions in sequence and wondering if I should use CloudFormation to achieve this:
Launch a new EC2 instance (currently I'm manually doing it by selecting "Launch more like these" on a specific instance.
Stop the new instance.
Detach the volume from the new instance.
Create a new volume from a previously created snapshot.
Attached that newly created volume to the new EC2 instance created in step 1.
Restart the EC2 instance.
If this can't be done via CloudFormation would it be possible to automate it somehow?
It sounds like you are wanting to launch an Amazon EC2 instance with the boot disk coming from an Amazon EBS Snapshot.
Might I suggest a simpler process?
Rather than creating a Snapshot of the Amazon EBS volume, instead create an Amazon Machine Image (AMI) of the original instance. Then, when launching the new Amazon EC2 instance, simply select the AMI. This will result in a new instance starting up with the desired boot disk.
Alternatively, you can create an AMI from an existing Amazon EBS Snapshot by selecting the Snapshot and choosing the Create Image command. (But I think this only works for Linux, not Windows.) Then, launch new EC2 instances from the AMI.
Behind-the-scenes, an AMI is actually just an Amazon EBS Snapshot with some additional information.
Take Johns advice and use an AMI. This sample will get you started, it launches a single EC2 using an AMI (latest patched one) in an Auto Scale Group of Min 1 - Max 1 so one EC2 instance will always be on regardless of a power failure, AZ going down, etc.
Replace XYZ with your products name:
Parameters:
KeyPairName:
Description: >-
Mandatory. Enter a Public/private key pair. If you do not have one in this region,
please create it before continuing
Type: 'AWS::EC2::KeyPair::KeyName'
EnvType:
Description: Environment Name
Default: dev
Type: String
AllowedValues: [dev, test, prod]
Subnet1ID:
Description: 'ID of the subnet 1 for auto scaling group into'
Type: 'AWS::EC2::Subnet::Id'
Subnet2ID:
Description: 'ID of the subnet 2 for auto scaling group'
Type: 'AWS::EC2::Subnet::Id'
Subnet3ID:
Description: 'ID of the subnet 3 for auto scaling group'
Type: 'AWS::EC2::Subnet::Id'
Resources:
XYZMainLogGroup:
Type: 'AWS::Logs::LogGroup'
SSHMetricFilter:
Type: 'AWS::Logs::MetricFilter'
Properties:
LogGroupName: !Ref XYZMainLogGroup
FilterPattern: ON FROM USER PWD
MetricTransformations:
- MetricName: SSHCommandCount
MetricValue: 1
MetricNamespace: !Join
- /
- - AWSQuickStart
- !Ref 'AWS::StackName'
XYZAutoScalingGroup:
Type: 'AWS::AutoScaling::AutoScalingGroup'
Properties:
LaunchConfigurationName: !Ref XYZLaunchConfiguration
AutoScalingGroupName: !Join
- '.'
- - !Ref 'AWS::StackName'
- 'ASG'
VPCZoneIdentifier:
- !Ref Subnet1ID
- !Ref Subnet2ID
- !Ref Subnet3ID
MinSize: 1
MaxSize: 1
Cooldown: '300'
DesiredCapacity: 1
Tags:
- Key: Name
Value: 'The Name'
PropagateAtLaunch: 'true'
XYZLaunchConfiguration:
Type: 'AWS::AutoScaling::LaunchConfiguration'
Properties:
AssociatePublicIpAddress: 'false'
PlacementTenancy: default
KeyName: !Ref KeyPairName
ImageId: ami-123432164a1b23da1
IamInstanceProfile: "BaseInstanceProfile"
InstanceType: t2.small
SecurityGroups:
- Fn::If: [CreateDevResources, !Ref DevSecurityGroup, !Ref "AWS::NoValue"]
Yes, you can automated all these tasks using SSM Automation.
Specifically, your SSM Automation can consist of the following documents/actions:
AWS-AttachEBSVolume
AWS-DetachEBSVolume
AWS-StopEC2Instance
AWS-StartEC2Instance
AWS-RestartEC2Instance
Your SSM Automation can be triggered by CloudWatch Events. Also the SSM Automation can be constructed using CloudFormation.

Cloudformation - How to reference the instance-id of an EC2 instance if the instance has been created using an Auto Scaling Group and a Launch Config

I am trying to create an EBS Volume and attach it to my EC2 instance. The instance has its own Auto Scaling Group and Launch Configuration. I want it such that if this instance becomes unhealthy and terminates, the EBS volume should automatically get attached to the new instance that is spun up by the Auto Scaling Group. The mount commands are in the Launch Configuration so that's not a problem.
Here is my code:
Influxdbdata1Asg:
Type: 'AWS::AutoScaling::AutoScalingGroup'
Properties:
TargetGroupARNs:
- !Ref xxxx
VPCZoneIdentifier:
- !GetAtt 'NetworkInfo.PrivateSubnet1Id'
LaunchConfigurationName: !Ref yyyy
MinSize: 1
MaxSize: 1
DesiredCapacity: 1
Data1:
Type: AWS::EC2::Volume
DeletionPolicy: Retain
Properties:
Size: !Ref 'DataEbsVolumeSize'
AvailabilityZone: !GetAtt 'NetworkInfo.PrivateSubnet1Id'
Tags:
- Key: Name
Value: !Join
- '-'
- - !Ref 'AWS::StackName'
- data1
Attachdata1:
Type: AWS::EC2::VolumeAttachment
Properties:
InstanceId: !Ref ????
VolumeId: !Ref Data1
Device: /dev/xvdb
Unfortunately you can't do this using:
Attachdata1:
Type: AWS::EC2::VolumeAttachment
Properties:
InstanceId: !Ref ????
VolumeId: !Ref Data1
Device: /dev/xvdb
The reason is that instance are being launched by ASG and you will not have its ideas.
Attaching must be done outside of CloudFormation, as can't know upfront what would be the instance id in future. As other answer mentions Lifecycle Hooks.
Or even better use, storage independent of ASG, such as EFS which would automatically persist between instance launches and terminations and could be mounted by multiple instances.
For this problem you would specifically want to make use of Lifecycle Hooks which trigger whenever an instance terminates or is launched.
To do this your lifecycle hook would notify your SNS notification, which would then invoke a Lambda function. This Lambda function would perform the change, before acknowledging the lifecycle action is complete.
There is a blog post written about this here.
Your question mentions CloudFormation, however this would still involve lifecycle hooks to trigger the action. You would need a CloudFormation stack with a AWS::EC2::VolumeAttachment resource. The Lambda would need to update the "InstanceId" property in the stack to perform this change.