Auto Scale Fargate Service Based On SQS ApproximateNumberOfMessagesVisible - amazon-web-services

I would like to scale out my aws fargate containers based on the size of the SQS queue. It appears that I can only scale based on the container's CPU or Memory usage. Is there a way to create a policy that would scale out and in based on queue size? Has anyone been able to scale based on other cloudwatch metrics?

Yes you can do this. You have to use a step scaling policy, and you need to have an alarm created already for your SQS queue depth (ApproximateNumberOfMessagesVisible).
Go to CloudWatch, create a new alarm. We'll call this alarm sqs-queue-depth-high, and have it trigger when the approximate number of messages visible is 1000.
With that done, go to ECS to the service you want to autoscale. Click Update for the service. Add a scaling policy and choose the Step Tracking variety. You'll see there's an option to create a new alarm (which only lets you choose between CPU or MemoryUtilization), or use an existing alarm.
Type sqs-queue-depth-high in the "Use existing alarm" field and press enter, you should see a green checkmark that lets you know the name is valid (i.e. the alarm exists). You'll see new dropdowns where you can adjust the step policy now.
This works for any metric alarm and ECS services. If you're going to be trying to scale this setup out, for multiple environments for example, or making it any more sophisticated than 2 steps, do yourself a favor and jump in with CloudFormation or Terraform to help manage it. Nothing is worse than having to adjust a 5-step alarm across 10 services.

AWS provides a solution for scaling based on SQS queue: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
Main idea
Create a CloudWatch Custom Metric sqs-backlog-per-task using formula:
sqs-backlog-per-task = sqs-messages-number / running-task-number.
Create a Target Tracking Scaling Policy based on the backlogPerInstance metric.
Implementation details
Custom Metric
In my case all the infrastructure (Fargate, SQS, and other resources) is described in CloudFormation stack. So for calculating and logging the custom metric I decided to use AWS Lambda function which is also described in CloudFormation stack and deployed together with the entire infrastructure.
Below you can find code snippets for the AWS Lambda function for logging the following custom metrics:
sqs-backlog-per-task - used for scaling
running-task-number - used for scaling optimization and debugging
AWS Lambda function described in AWS SAM syntax in CloudFormation stack (infrastructure.yml):
CustomMetricLoggerFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: custom-metric-logger
Handler: custom-metric-logger.handler
Runtime: nodejs8.10
MemorySize: 128
Timeout: 3
Role: !GetAtt CustomMetricLoggerFunctionRole.Arn
Environment:
Variables:
ECS_CLUSTER_NAME: !Ref Cluster
ECS_SERVICE_NAME: !GetAtt Service.Name
SQS_URL: !Ref Queue
Events:
Schedule:
Type: Schedule
Properties:
Schedule: 'cron(0/1 * * * ? *)' # every one minute
AWS Lambda Javascript code for calculating and logging (custom-metric-logger.js):
var AWS = require('aws-sdk');
exports.handler = async () => {
try {
var sqsMessagesNumber = await getSqsMessagesNumber();
var runningContainersNumber = await getRunningContainersNumber();
var backlogPerInstance = sqsMessagesNumber;
if (runningContainersNumber > 0) {
backlogPerInstance = parseInt(sqsMessagesNumber / runningContainersNumber);
}
await putRunningTaskNumberMetricData(runningContainersNumber);
await putSqsBacklogPerTaskMetricData(backlogPerInstance);
return {
statusCode: 200
};
} catch (err) {
console.log(err);
return {
statusCode: 500
};
}
};
function getSqsMessagesNumber() {
return new Promise((resolve, reject) => {
var data = {
QueueUrl: process.env.SQS_URL,
AttributeNames: ['ApproximateNumberOfMessages']
};
var sqs = new AWS.SQS();
sqs.getQueueAttributes(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(parseInt(data.Attributes.ApproximateNumberOfMessages));
}
});
});
}
function getRunningContainersNumber() {
return new Promise((resolve, reject) => {
var data = {
services: [
process.env.ECS_SERVICE_NAME
],
cluster: process.env.ECS_CLUSTER_NAME
};
var ecs = new AWS.ECS();
ecs.describeServices(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data.services[0].runningCount);
}
});
});
}
function putRunningTaskNumberMetricData(value) {
return new Promise((resolve, reject) => {
var data = {
MetricData: [{
MetricName: 'running-task-number',
Value: value,
Unit: 'Count',
Timestamp: new Date()
}],
Namespace: 'fargate-sqs-service'
};
var cloudwatch = new AWS.CloudWatch();
cloudwatch.putMetricData(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data);
}
});
});
}
function putSqsBacklogPerTaskMetricData(value) {
return new Promise((resolve, reject) => {
var data = {
MetricData: [{
MetricName: 'sqs-backlog-per-task',
Value: value,
Unit: 'Count',
Timestamp: new Date()
}],
Namespace: 'fargate-sqs-service'
};
var cloudwatch = new AWS.CloudWatch();
cloudwatch.putMetricData(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data);
}
});
});
}
Target Tracking Scaling Policy
Then based on the sqs-backlog-per-task metric I created Target Tracking Scaling Policy in my Cloud Formation template.
Target Tracking Scaling Policy based on the sqs-backlog-per-task metric (infrastructure.yml):
ServiceScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: service-scaling-policy
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref ServiceScalableTarget
TargetTrackingScalingPolicyConfiguration:
ScaleInCooldown: 60
ScaleOutCooldown: 60
CustomizedMetricSpecification:
Namespace: fargate-sqs-service
MetricName: sqs-backlog-per-task
Statistic: Average
Unit: Count
TargetValue: 2000
As a result AWS Application Auto Scaling creates and manages the CloudWatch alarms that trigger the scaling policy and calculates the scaling adjustment based on the metric and the target value. The scaling policy adds or removes capacity as required to keep the metric at, or close to, the specified target value. In addition to keeping the metric close to the target value, a target tracking scaling policy also adjusts to changes in the metric due to a changing load pattern.

Update to 2021 (before maybe...)
For those who need it but in CDK
An example use case:
// Create the vpc and cluster used by the queue processing service
const vpc = new ec2.Vpc(stack, 'Vpc', { maxAzs: 2 });
const cluster = new ecs.Cluster(stack, 'FargateCluster', { vpc });
const queue = new sqs.Queue(stack, 'ProcessingQueue', {
QueueName: 'FargateEventQueue'
});
// Create the queue processing service
new QueueProcessingFargateService(stack, 'QueueProcessingFargateService', {
cluster,
image: ecs.ContainerImage.fromRegistry('amazon/amazon-ecs-sample'),
desiredTaskCount: 2,
maxScalingCapacity: 5,
queue
});
from:
https://github.com/aws/aws-cdk/blob/master/design/aws-ecs/aws-ecs-autoscaling-queue-worker.md

I wrote a blog article about exactly this topic including a docker container to run it.
The article can be found at:
https://allaboutaws.com/how-to-auto-scale-aws-ecs-containers-sqs-queue-metrics
The prebuild container is available at DockerHub:
https://hub.docker.com/r/sh39sxn/ecs-autoscaling-sqs-metrics
The files are available at GitHub:
https://github.com/sh39sxn/ecs-autoscaling-sqs-metrics
I hope it helps you.

Related

ECS task unable to pull secrets or registry auth

I have a CDK project that creates a CodePipeline which deploys an application on ECS. I had it all previously working, but the VPC was using a NAT gateway, which ended up being too expensive. So now I am trying to recreate the project without requiring a NAT gateway. I am almost there, but I have now run into issues when the ECS service is trying to start tasks. All tasks fail to start with the following error:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 5 time(s): failed to fetch secret
At this point I've kind of lost track of the different things I have tried, but I will post the relevant bits here as well as some of my attempts.
const repository = ECR.Repository.fromRepositoryAttributes(
this,
"ecr-repository",
{
repositoryArn: props.repository.arn,
repositoryName: props.repository.name,
}
);
// vpc
const vpc = new EC2.Vpc(this, this.resourceName(props, "vpc"), {
maxAzs: 2,
natGateways: 0,
enableDnsSupport: true,
});
const vpcSecurityGroup = new SecurityGroup(this, "vpc-security-group", {
vpc: vpc,
allowAllOutbound: true,
});
// tried this to allow the task to access secrets manager
const vpcEndpoint = new EC2.InterfaceVpcEndpoint(this, "secrets-manager-task-vpc-endpoint", {
vpc: vpc,
service: EC2.InterfaceVpcEndpointAwsService.SSM,
});
const secrets = SecretsManager.Secret.fromSecretCompleteArn(
this,
"secrets",
props.secrets.arn
);
const cluster = new ECS.Cluster(this, this.resourceName(props, "cluster"), {
vpc: vpc,
clusterName: `api-cluster`,
});
const ecsService = new EcsPatterns.ApplicationLoadBalancedFargateService(
this,
"ecs-service",
{
taskSubnets: {
subnetType: SubnetType.PUBLIC,
},
securityGroups: [vpcSecurityGroup],
serviceName: "api-service",
cluster: cluster,
cpu: 256,
desiredCount: props.scaling.desiredCount,
taskImageOptions: {
image: ECS.ContainerImage.fromEcrRepository(
repository,
this.ecrTagNameParameter.stringValue
),
secrets: getApplicationSecrets(secrets), // returns
logDriver: LogDriver.awsLogs({
streamPrefix: "api",
logGroup: new LogGroup(this, "ecs-task-log-group", {
logGroupName: `${props.environment}-api`,
}),
logRetention: RetentionDays.TWO_MONTHS,
}),
},
memoryLimitMiB: 512,
publicLoadBalancer: true,
domainZone: this.hostedZone,
certificate: this.certificate,
redirectHTTP: true,
}
);
const scalableTarget = ecsService.service.autoScaleTaskCount({
minCapacity: props.scaling.desiredCount,
maxCapacity: props.scaling.maxCount,
});
scalableTarget.scaleOnCpuUtilization("cpu-scaling", {
targetUtilizationPercent: props.scaling.cpuPercentage,
});
scalableTarget.scaleOnMemoryUtilization("memory-scaling", {
targetUtilizationPercent: props.scaling.memoryPercentage,
});
secrets.grantRead(ecsService.taskDefinition.taskRole);
repository.grantPull(ecsService.taskDefinition.taskRole);
I read somewhere that it probably has something to do with Fargate version 1.4.0 vs 1.3.0, but I'm not sure what I need to change to allow the tasks to access what they need to run.
You need to create an interface endpoints for Secrets Manager, ECR (two types of endpoints), CloudWatch, as well as a gateway endpoint for S3.
Refer to the documentation on the topic.
Here's an example in Python, it'd work the same in TS:
vpc.add_interface_endpoint(
"secretsmanager_endpoint",
service=ec2.InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
)
vpc.add_interface_endpoint(
"ecr_docker_endpoint",
service=ec2.InterfaceVpcEndpointAwsService.ECR_DOCKER,
)
vpc.add_interface_endpoint(
"ecr_endpoint",
service=ec2.InterfaceVpcEndpointAwsService.ECR,
)
vpc.add_interface_endpoint(
"cloudwatch_logs_endpoint",
service=ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH_LOGS,
)
vpc.add_gateway_endpoint(
"s3_endpoint",
service=ec2.GatewayVpcEndpointAwsService.S3
)
Keep in mind that interface endpoints cost money as well, and may not be cheaper than a NAT.

Creating API key for Usage Plan from AWS Lambda

I would like to create a new api key from lambda. I have usage plan with my Gateway API, created with CF like:
MyApi:
Type: AWS::Serverless::Api
Properties:
Auth:
UsagePlan:
UsagePlanName: MyUsagePlan
CreateUsagePlan: PER_API
...
...
Using this as a reference https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/APIGateway.html
I guess the process in the lambda should be like this:
- createApiKey
- getUsagePlan
- createUsagePlanKey
In the lambda, I have MyApi id and I'm trying to fetch the api:
var apiGateway = new AWS.APIGateway({region: region});
const restApi = await new Promise((resolve, reject) => {
apiGateway.getRestApi({restApiId: MYAPI_ID}, function(err, data) {
if (err) {
console.log('getRestApi err', err, err.stack);
reject(err);
} else {
console.log('getRestApi', data);
resolve(data);
}
});
});
But this gets timed out by my lambda.
If I try to input values manually, it gets timed out as well:
const keyParams = {
keyId: 'xxxxxxxx',
keyType: 'API_KEY',
usagePlanId: 'yyyyyyyy'
};
const apiKey = await new Promise((resolve, reject) => {
apiGateway.createUsagePlanKey(keyParams, function (err, data) {
if (err) {
console.log('createUsagePlanKey err', err, err.stack);
reject(err);
} else {
console.log('createUsagePlanKey', data);
resolve(data);
}
});
});
Why do every function call to api get timed out and nothing gets printed in console.log? Is my approach ok or how should I create the new api key for a user?
Edited: Timeout for lambdas is 10 seconds and they run in VPC
It sounds like you probably haven't configured your VPC to allow your Lambda function to access resources (like the AWS API) that exist outside the VPC. First, is it really necessary to run the function inside a VPC? If not then removing it from the VPC should fix the issue.
If it is necessary to run the function in a VPC, then you will need to place your Lambda function inside a private subnet with a route to a NAT Gateway, or configure a VPC endpoint for the AWS services it needs to access.

SQS Interface Endpoint in CDK

Working with AWS-CDK. I had to move my Lambda that writes to SQS inside a VPC. I added the Interface Gateway to allow for direct connection from VPC to SQS with:
props.vpc.addInterfaceEndpoint('sqs-gateway', {
service: InterfaceVpcEndpointAwsService.SQS,
subnets: {
subnetType: SubnetType.PRIVATE,
},
})
the Lambda is deployed to that same VPC (to the same private subnet by default) and I pass the QUEUE_URL as env parameter as I did without the VPC:
const ingestLambda = new lambda.Function(this, 'TTPIngestFunction', {
...
environment: {
QUEUE_URL: queue.queueUrl,
},
vpc: props.vpc,
})
and the Lambda code sends messages simply with:
const sqs = new AWS.SQS({ region: process.env.AWS_REGION })
return sqs
.sendMessageBatch({
QueueUrl: process.env.QUEUE_URL as string,
Entries: entries,
})
.promise()
without the VPC, this sending works but now the Lambda just timeouts to the sending of SQS messages. What am I missing here?
By default, interface VPC endpoints create a new security group and traffic is not
automatically allowed from the VPC CIDR.
You can do as follows if you want to allow traffic from your Lambda:
const sqsEndpoint = props.vpc.addInterfaceEndpoint('sqs-gateway', {
service: InterfaceVpcEndpointAwsService.SQS,
});
sqsEndpoint.connections.allowDefaultPortFrom(ingestLambda);
Alternatively, you can allow all traffic:
sqsEndpoint.connections.allowDefaultPortFromAnyIpv4();
This default behavior is currently under discussion in https://github.com/aws/aws-cdk/pull/4938.

AWS SDK runinstance and IAM roles

The code below works when I have added the AWS IAM role "AdministratorAccess" - But it is risky and a bit of overkill... But how do I know and find only the necessary role(s)...It is very confusing and hard to know by when I look at all the possible roles in the console?
try {
// Load the AWS SDK for Node.js
var AWS = require('aws-sdk');
// Set the region
AWS.config.update({region: 'us-east-2'});
var instanceParams = {
ImageId: 'ami-xxxxxxxxxxxx',
InstanceType: 't2.micro',
KeyName: 'xxxxxxxxxx',
SecurityGroups: ['xxxxxxxxxxxxxxx'],
MinCount: 1,
MaxCount: 1
};
// Create a promise on an EC2 service object
var instancePromise = new AWS.EC2({apiVersion: '2016-11-15'}).runInstances(instanceParams).promise();
// Handle promise's fulfilled/rejected states
instancePromise.then(
function (data) {
console.log(data);
var instanceId = data.Instances[0].InstanceId;
console.log("Created instance", instanceId);
// Add tags to the instance
var tagParams = {
Resources: [instanceId], Tags: [
{
Key: 'Name',
Value: 'SDK Sample'
}
]
};
// Create a promise on an EC2 service object
var tagPromise = new AWS.EC2({apiVersion: '2016-11-15'}).createTags(tagParams).promise();
// Handle promise's fulfilled/rejected states
tagPromise.then(
function (data) {
console.log("Instance tagged");
}).catch(
function (err) {
console.error(err, err.stack);
});
}).catch(
function (err) {
console.error(err, err.stack);
});
}
catch(e){
wl.info('Error: ' + e);
}
Firstly you can see the api's you are calling via the sdk as a hint to what permissions you need i.e ec2:RunInstance and ec2:CreateTags.
You first create a policy then select the service then attach permissions (RunInstances and CreateTags)
You then create a Role with that policy attached.
Then you can attach the role to your Lambda

access all ec2 cross region via lambda

I have lambda function for auto Ami backup is possible to execute lambda across the region for take automatic backup of all my EC2 working on account.
One lambda function execution for all ec2 across region
var aws = require('aws-sdk');
aws.config.region = 'us-east-1','ap-south-1','eu-central-1';
var ec2 = new aws.EC2();
var now = new Date();
date = now.toISOString().substring(0, 10)
hours = now.getHours()
minutes = now.getMinutes()
exports.handler = function(event, context) {
var instanceparams = {
Filters: [{
Name: 'tag:Backup',
Values: [
'yes'
]
}]
}
ec2.describeInstances(instanceparams, function(err, data) {
if (err) console.log(err, err.stack);
else {
for (var i in data.Reservations) {
for (var j in data.Reservations[i].Instances) {
instanceid = data.Reservations[i].Instances[j].InstanceId;
nametag = data.Reservations[i].Instances[j].Tags
for (var k in data.Reservations[i].Instances[j].Tags) {
if (data.Reservations[i].Instances[j].Tags[k].Key == 'Name') {
name = data.Reservations[i].Instances[j].Tags[k].Value;
}
}
console.log("Creating AMIs of the Instance: ", name);
var imageparams = {
InstanceId: instanceid,
Name: name + "_" + date + "_" + hours + "-" + minutes,
NoReboot: true
}
ec2.createImage(imageparams, function(err, data) {
if (err) console.log(err, err.stack);
else {
image = data.ImageId;
console.log(image);
var tagparams = {
Resources: [image],
Tags: [{
Key: 'DeleteOn',
Value: 'yes'
}]
};
ec2.createTags(tagparams, function(err, data) {
if (err) console.log(err, err.stack);
else console.log("Tags added to the created AMIs");
});
}
});
}
}
}
});
}
where aws.config.region is for region config..it's working for current(in which lambda deploy) region
This line:
var ec2 = new aws.EC2();
connects to the Amazon EC2 service in the region where the Lambda function is running.
You can modify it to connect to another region:
var ec2 = new AWS.EC2({apiVersion: '2006-03-01', region: 'us-west-2'});
Thus, your program could loop through a list of regions (from ec2.describeRegions), creating a new EC2 client for the given region, then running the code you already have.
See: Setting the AWS Region - AWS SDK for JavaScript
In your Lambda Role, you need to add a policy which gives the Lambda function necessary permissions to access the EC2 on different accounts, typically you can add ARN's of EC2 instances you wan't access to or you can specify "*" which gives permissions to all instances.
Also on other accounts where EC2 instances are running you need to add IAM policy which gives access to your Lambda Role, note that you need to provide Lambda role ARN,
In this way your Lambda role will have policy to access EC2 and cross account EC2 will have policy which grant's access to Lambda role.
Without this in place you might need to do heavy lifting of configuring IP's of each EC2 in each account.
Yes and you also need to point EC2 object to a region where the instance is running,
Any code (including a Lambda function) can create a client that connects to a different region.