Cancel AWS CDK deployment after X failed task - amazon-web-services

I am deploying a service on aws using an ApplicationLoadBalancedEc2Service.
Sometimes while doing some testing, I deploy a configuration that results in errors. The problem is that instead of canceling the deployment, the cdk just hangs for hours. The reason is that AWS tries to keep spinning up a task (which fails due to my wrong configuration).
Right now I have to set the task number to 0 through the AWS console. This will cause to successfully complete the deployment and allow me to spin a new version.
Is there a way to cancel the deployment and just rollback after X amount of failed tasks?

One way is to configure CodeDeploy to roll back the service to its previous version if the new deployment fails. This won't "cancel the CDK deployment", but will stabilize the service.
Another way is to add a Custom Resource with an asynchronous provider to poll the ECS service status, signaling CloudFormation if your success condition is not met. This will revert the CDK deployment itself.

You're looking for the Circuit Breaker feature:
declare const cluster: ecs.Cluster;
const loadBalancedEcsService = new ecsPatterns.ApplicationLoadBalancedEc2Service(this, 'Service', {
cluster,
memoryLimitMiB: 1024,
taskImageOptions: {
image: ecs.ContainerImage.fromRegistry('test'),
},
desiredCount: 2,
circuitBreaker: { rollback: true }
});
It will give your deploy between 10 and 200 tries (0.5 times your desired task count, with these min/max values), before to cancel your deploy. The rollback argument allows you to re-launch tasks with the previous task definition.

Related

Debugging Pulumi "ResourceNotReady: exceeded wait attempts"

I am trying to deploy a fargate service on AWS ECS with Pulumi as IaC.
Everything works as expected when deploying my Fargate service with:
deploymentController: {
type: "ECS"
},
But changing it to:
deploymentController: {
type: "CODE_DEPLOY"
},
Ends with error message: "ResourceNotReady: exceeded wait attempts"
Is there any way to debug this that would help me to find out what resource Pulumi is waiting for?
Is there some hidden dependencies for Blue/Green deployment on ECS that is not obvious when deploying with Pulumi?
Are you deploying to an ECS Cluster living within a different stack than your Fargate service stack?
If so then that's the reason behin the timeout error. Cause the stack isn't able to ping the service and make sure it's steady ready, since it's in a different stack.

How to run an AWS ECS task inside a service of AWS ECS cluster and not ouside the service from Circle CI's "aws-ecs/run-task"

I am using Circle CI to build and push the image to AWS ECR, then use this image to deploy a container(with FARGATE as instance) in a service inside a cluster in AWS ECS. The problem is, the tasks are being run outside of this service, but in the same cluster.
Here's task sitting along the task that was started automatically by AWS:
The one that has group called service:adp-ecs-service is the one running inside service and the one with group adp-ecs-service is the one that is running outside of service. The one that has group called service:adp-ecs-service will be restarted automatically with the image in ECR tagged 'latest_ci', if i stop it, but other one won't start. And this service can only have one service at a time.
By looking at group in this image, I tried to specify the name of the service in group tag in 'config.yml' file in multiple ways, but to no avail. Here's all I have tried(you can see these in my commits here):
'service:adp-ecs-service'
service:adp-ecs-service
Both of them(with and without quotes) genereted following error:
An error occurred (InvalidParameterException) when calling the RunTask operation: Invalid namespace for group.
adp-ecs-service
service
Both of them ran the task outside the service(you can see the output of 3 in the image shown above. I have even tried environment variables for 1 and 2 too.
Here is the code for "aws-ecs/run-task" in "/.circleci/config.yml" (complete file can be found here):
- aws-ecs/run-task:
requires:
- aws-ecr/build-and-push-image
task-definition: 'adp-ecs-family'
group: adp-ecs-service
cluster: 'adp-cluster'
aws-access-key-id: AWS_ACCESS_KEY_ID
aws-secret-access-key: AWS_SECRET_ACCESS_KEY
aws-region: 'ap-south-1'
awsvpc: true
launch-type: FARGATE
subnet-ids: $AWS_SUBNETS
security-group-ids: $AWS_ADP_SG
started-by: 'circle-ci'
assign-public-ip: ENABLED
And here is the service:
As you can see in the above picture, running task count and the desired task count is 1, but there are 2 tasks running in this cluster, one outside of the service.
What I want here is, if I run a new task inside the service, it should start a container by pulling the latest image from ECR by stopping the previous one. So, How do I properly specify the name of the service and accomplish this?
Simply put aws-ecs/run-task is not what you want. You need to deploy the new task definition to the service, not run it.
You are looking for aws-ecs/deploy-service-update and update-task-definition if you haven't already.

Amazon ECS Service configuration return exactly 1 result, but got > '0'

I am trying to update an ECS service with bamboo and get the following error:
Failed to fetch resource from AWS!
java.lang.RuntimeException: Expected DescribeServiceRequest for
service 'my-service' to return exactly 1 result, but got
'0' at
net.utoolity.atlassian.bamboo.taws.aws.ECS.getSingleService(ECS.java:674)
at
net.utoolity.atlassian.bamboo.taws.ECSServiceTask.executeUpdate(ECSServiceTask.java:311)
at
net.utoolity.atlassian.bamboo.taws.ECSServiceTask.execute(ECSServiceTask.java:133)
at
net.utoolity.atlassian.bamboo.taws.AWSTask.execute(AWSTask.java:164)
at
com.atlassian.bamboo.task.TaskExecutorImpl.lambda$executeTasks$3(TaskExecutorImpl.java:319)
at
com.atlassian.bamboo.task.TaskExecutorImpl.executeTaskWithPrePostActions(TaskExecutorImpl.java:252)
at
com.atlassian.bamboo.task.TaskExecutorImpl.executeTasks(TaskExecutorImpl.java:319)
at
com.atlassian.bamboo.task.TaskExecutorImpl.execute(TaskExecutorImpl.java:112)
at
com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.call(ExecuteBuildTask.java:73)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.executeBuildPhase(DefaultBuildAgent.java:203)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.build(DefaultBuildAgent.java:175)
at
com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.lambda$waitAndPerformBuild$0(BuildAgentControllerImpl.java:129)
at
com.atlassian.bamboo.variable.CustomVariableContextImpl.withVariableSubstitutor(CustomVariableContextImpl.java:185)
at
com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.waitAndPerformBuild(BuildAgentControllerImpl.java:123)
at
com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent$1.run(DefaultBuildAgent.java:126)
at
com.atlassian.bamboo.utils.BambooRunnables$1.run(BambooRunnables.java:48)
at
com.atlassian.bamboo.security.ImpersonationHelper.runWith(ImpersonationHelper.java:26)
at
com.atlassian.bamboo.security.ImpersonationHelper.runWithSystemAuthority(ImpersonationHelper.java:17)
at
com.atlassian.bamboo.security.ImpersonationHelper$1.run(ImpersonationHelper.java:41)
at java.lang.Thread.run(Thread.java:745)
I am using the Force new deployment setting.
Any ideas what is the issue?
We have not been able to identify an bug in our code base right away, here's what's seemingly happening:
In order to append progress messages to the Bamboo build log, we need to call the DescribeServices API action before the call to the actual UpdateService API action, and the exception is thrown if and only if the targeted service cannot be found.
So at first glance there may be a subtle configuration issue, which happens to me every now and then when using Bamboo variables to reference resources from a preceding task, where it is easy to accidentally copy and paste the wrong variable name for example.
An incorrect reference in any of the following parameters of the Amazon ECS Service task's Update Service action would yield the resp. task action to fail with the error message at hand, because the DescribeServices API call in itself would succeed, yet fail to identify the target service:
Connector
Region
Service Name
For example, I've just reproduced the problem by using a non existing service name:
24-Oct-2019 17:37:05 Starting task 'Update sample ECS service (w/ ELB) - 2 instances' of type 'net.utoolity.atlassian.bamboo.tasks-for-aws:aws.ecs.service'
24-Oct-2019 17:37:05 Setting maxErrorRetry=7 and awaitTransitionInterval=15000
24-Oct-2019 17:37:05 Using session credentials provided by Identity Federation for AWS app (connector variable: 6f6fc85d-4ea5-43ce-8e70-25aba33a5fda).
24-Oct-2019 17:37:05 Selecting region eu-west-1
24-Oct-2019 17:37:05 Updating service 'NOT-A-SERVICE' on cluster 'TAWS-IT270-100-ubot':
24-Oct-2019 17:37:06 Failed to fetch resource from AWS!
24-Oct-2019 17:37:06 java.lang.RuntimeException: Expected DescribeServiceRequest for service 'NOT-A-SERVICE' to return exactly 1 result, but got '0'
...
Granted, the error message is not exactly helpful here, and we need to think about how to better handle this log pattern across our various tasks - the actual UpdateServiceAPI action would yield the much more appropriate ServiceNotFoundException exception in this scenario.
So assuming 'my-service' has been up and running before calling the 'Update Service' task action, can you please check whether the log from your failing Bamboo build may indicate this particular problem, for example by targeting another region by chance?
I could solve the issue by using a Shell Script Task and wrote a aws-cli command after exporting the keys. This workaround solved the issue:
aws ecs update-service --cluster my-cluster --service my-service --task-definition my-task-definition
So the AWS ECS is working fine and it should be a bug or misconfiguration in the Bamboo module.
But as mentioned in the other answer, the best approach would be to check if the configuration is correct.

serverless deployment fails on checking stack progress AWS

Problem:
I have two lambda functions on AWS representing two different environments (staging and production). The production environment has a data import function which runs every 10 mins. The problem I am facing is that when I try to deploy staging environment, error occurs on the stack update progress as shown:
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.........................
Serverless: Operation failed!
Serverless Error ---------------------------------------
An error occurred: MyimportfunctionEventsRuleSchedule1 - schedule-full-import already exists in stack Cloudformation_StackId_of_production_lambda_function.
EDIT:
schedule-full-import function is for only production environment, not the staging environment. My understanding is that when I try to deploy, it just tries to find the trigger for the staging environment. In this case, it does not find it and then goes towards the production environment.
serverless.yml
schedule_full_import:
handler: my_handler
timeout: 6
events:
- schedule:
enabled: true
name: full-data-import
rate: rate(10 minutes)
stageParams:
stage: prod
I dont want to trigger this function for the staging environment since it is not needed. Any help is appreciated.
You can remove your existing CloudFormation stack manually if $ sls remove didn't work.
And then redeploy your stack from the scratch.
Of course, make sure you deleted .serverless directory before new deployment.
I believe the issue is that stageParams doesn't do what you think it does. It does not attach the lambda to the Cloudwatch trigger only in the prod stage. The Serverless docs (https://serverless.com/framework/docs/providers/aws/events/schedule/) has a confusing example that lists stageParams as an input value to the trigger. All that means is that Cloudwatch will invoke the lambda with the value of input as the event data.
There isn't a way to selectively not deploy resources listed in serverless.yml depending on the stage. What you could do is set enabled to false when stage isn't prod by using some custom configuration parameters. This would deploy the trigger to your staging environment, but it would not be invoked.
The CloudFormation error also suggests there is a naming conflict. Serverless should be generating unique lambda names based on the stage, so if I had to guess the schedule name full-data-import isn't unique. I would try renaming it to something like
name: full-data-import-${self:provider.stage}
Depending on how you reference your stage parameter.
You could try something like:
custom:
importEnabled: <set this by config file, command line argument, environment variable, etc>
functions:
schedule_full_import:
handler: my_handler
timeout: 6
events:
- schedule:
name: full-data-import-${self:provider.stage}
enabled: ${self:custom.importEnabled}
rate: rate(10 minutes)
See https://serverless.com/framework/docs/providers/aws/guide/variables/ for ways you could set value of importEnabled

Rollback a build using AWS CodePipeline

What is the best mechanism to implement to rollback a deployment that is orchestrated using CodePipeline? The source comes from a S3 bucket and we are looking to see if there is a one-lick rollback mechanism without manual intervention.
CodePipeline doesn't support rollback currently. If you are using CodeDeploy as the deployment action, you can setup rollback on alarm or failed deployment on the CodeDeploy DeploymentGroup. The cloud formation template to enable auto-rollback for a CodeDeploy deployment group looks like:
Type: "AWS::CodeDeploy::DeploymentGroup"
Properties:
...
AutoRollbackConfiguration:
Enabled: true
Events:
- "DEPLOYMENT_FAILURE"
- "DEPLOYMENT_STOP_ON_ALARM"
AlarmConfiguration:
Alarms:
- CloudWatchAlarm1
- CloudWatchAlarm2
Enabled: true
You can find more information about it at Deployments and Redeploy
In case we are not using AWS CodeDeploy, then anyday we can use the manual way of rollback, which is to redeploy the previous stable build or tag.