How to best deploy code changes to ECS using CodeDeploy? - amazon-web-services

We deploy a Docker image that runs a simple Sinatra API to an ECS Fargate service. Right now our task definition defines the image using a :production tag. We want to use CodeDeploy for a blue/green deployment.
When code is changed - should we push a new image with the :production tag and force a new deployment on our service or instead use specific tags in our task definition (e.g. :97b9d390d869874c35c325632af0fc1c08e013cd) and create a new task revision then update our service to use this new task revision?
Our concern with the second approach is that we don't see any lifecycle rules around task revisions so will they just build up until we have tens/hundreds of thousands?
If we use the first approach, will CodeDeploy be able to roll back a failed deployment in the case there is an issue?

Short answer
In both cases there are no definition roll back if somehow your new image crashed but your current old task should still be alive. But if you are using health check and the current running task is below the required amount (might be due to overflow of user traffic,...etc), Fargate would start up new task with the latest task definition revision which contained the bad image.
Long answer
Since you are just asking CodeDeploy to start up task based on your image, it would create a new task definition that have your image's URI to pull the correct image. And that new task definition would always be used to start up new Fargate task.
So when Fargate found that it needs to create task, it would always try to use the latest revision which would always be the one with bad image.
The good thing is that your old image task if works correctly, it should still be alive, since the minimum running task would be 1 and since the other task is failing continuously, your old image task would not be decommissioned.
You can however overcome this by adding a CloudWatch event to trigger a lambda that either update new task revision with the good image tag or running current Fargate with the previous task definition revision. Here is an article from AWS about this: https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/
A bit more on how Fargate deployment work here and help your old task running when new deployment fail, it would first provision the new task, when all the new tasks is running good, it would decommission old task. So in case the new tasks does not run properly, old task should still be alive.

Related

Django migrations deployment strategy with AWS ECS Fargate?

What is the recommended deployment strategy for running database migrations with ECS Fargate?
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
That is one possible solution. To avoid the concurrency issue you would have to add some sort of distributed locking in your container script to grab a lock from DynamoDB or something before running migrations. I've seen it done this way.
Another option I would propose is running your Django migrations from an AWS CodeBuild task. You could either trigger it manually before deployments, or automatically before a deployment as part of a larger CI/CD deployment pipeline. That way you would at least not have to worry about more than one running at a time.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
That's a problem with every database migration in every system that has ever been created. If you are very worried about it you would have to do blue-green deployments with separate databases to avoid this issue. Or you could just accept some down-time during deployments by configuring ECS to stop all old tasks before starting new ones.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
This is a good idea, but I'm not aware of any way to set DependsOn for separate tasks. The only DependsOn setting I'm aware of in ECS is for multiple containers in a single task.

Using a sidecar to download artefacts in pod

I am rebuilding a system like gitlab where users can configure pipelines with individual jobs running on AWS ECS (Fargate).
One important functionality is donwloading and uploading of artefacts (files) generated by these jobs.
I want to solve this by running a sidecar with the logic responsible for the artefacts next to the actual code running the logic of the job.
one important requirement: it needs to be assumed that the "main" container runs custom code which i cannot control.
It seems however there is no clean solution in kubernetes for starting a pod with this order of containers:
start a sidecar, trigger download of artefacts
upon completion of artefacts download, start the logic in the main container alongside sidecar and run it to finish
upon finish of main container start upload of new artefacts and end sidecar.
Any suggestion is welcome
Edit:
I found the attribute of container dependencies on AWS ECS and will try it out now: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/example_task_definitions.html#example_task_definition-containerdependency

Manage image version in ECS task definition

I saw the post How to manage versions of docker image in AWS ECS? and didn’t get a good answer for the question.
In case of updating container image version (for example , from alpine:1.0.0 to alpine 1.0.1)
What is the best practice to update the container image in the task definition? I’m using only one container per task definition.
As far as I understand there are two alternatives:
Create new revision of task definition
Create new task definition that its name contains the version of the image.
The pros of the first option are that I’m creating only one task definition, but the cons are that in case that I want to create new revision only if the definition was changed, then I need to describe the task, get the image from the container list, and then compare the version with the new image version.
Regarding the second option, is that I can see exactly if I created there is a task definition that contains my image or not. The cons are that I will create new task definition for every image version.
In both options, how should I handle the deregister logic?
Probably I missed something so would appreciate your answer.
Thanks!
I've only ever seen the first alternative (Create new revision of task definition) used. If you are using Infrastructure as Code, such as CloudFormation or Terraform, then all the cons you have listed for that are no longer present.
"In both options, how should I handle the deregister logic?"
Just update the ECS service to use the latest version of the task definition. ECS will then deploy a new task, migrate the traffic to that task, and shut down the old task. You don't need to do anything else at that point. There is no special logic you need to implement yourself to deregister anything.

How to manage automatic deployment to ECS using Terraform Cloud and CircleCI?

I have an ECS task which has 2 containers using 2 different images, both hosted in ECR. There are 2 GitHub repos for the two images (app and api), and a third repo for my IaC code (infra). I am managing my AWS infrastructure using Terraform Cloud. The ECS task definition is defined there using Cloudposse's ecs-alb-service-task, with the containers defined using ecs-container-definition. Presently I'm using latest as the image tag in the task definition defined in Terraform.
I am using CircleCI to build the Docker containers when I push changes to GitHub. I am tagging each image with latest and the variable ${CIRCLE_SHA1}. Both repos also update the task definition using the aws-ecs orb's deploy-service-update job, setting the tag used by each container image to the SHA1 (not latest). Example:
container-image-name-updates: "container=api,tag=${CIRCLE_SHA1}"
When I push code to the repo for e.g. api, a new version of the task definition is created, the service's version is updated, and the existing task is restarted using the new version. So far so good.
The problem is that when I update the infrastructure with Terraform, the service isn't behaving as I would expect. The ecs-alb-service-task has a boolean called ignore_changes_task_definition, which is true by default.
When I leave it as true, Terraform Cloud successfully creates a new version whenever I Apply changes to the task definition. (A recent example was to update environment variables.) BUT it doesn't update the version used by the service, so the service carries on using the old version. Even if I stop a task, it will respawn using the old version. I have to manually go in and use the Update flow, or push changes to one of the code repos. Then CircleCI will create yet aother version of the task definition and update the service.
If I instead set this to false, Terraform Cloud will undo the changes to the service performed by CircleCI. It will reset the task definition version to the last version it created itself!
So I have three questions:
How can I get Terraform to play nice with the task definitions created by CircleCI, while also updating the service itself if I ever change it via Terraform?
Is it a problem to be making changes to the task definition from THREE different places?
Is it a problem that the image tag is latest in Terraform (because I don't know what the SHA1 is)?
I'd really appreciate some guidance on how to properly set up this CI flow. I have found next to nothing online about how to use Terraform Cloud with CI products.
I have learned a bit more about this problem. It seems like the right solution is to use a CircleCI workflow to manage Terraform Cloud, instead of having the two services effectively competing with each other. By default Terraform Cloud will expect you to link a repo with it and it will auto-plan every time you push. But you can turn that off and use the terraform orb instead to run plan/apply via CircleCI.
You would still leave ignore_changes_task_definition set to true. Instead, you'd add another step to the workflow after the terraform/apply step has made the change. This would be aws-ecs/run-task, which should relaunch the service using the most recent task definition, which was (possibly) just created by the previous step. (See the task-definition parameter.)
I have decided that this isn't worth the effort for me, at least not at this time. The conflict between Terraform Cloud and CircleCI is annoying, but isn't that acute.

Can I configure Google DataFlow to keep nodes up when I drain a pipeline

I am deploying a pipeline to Google Cloud DataFlow using Apache Beam. When I want to deploy a change to the pipeline, I drain the running pipeline and redeploy it. I would like to make this faster. It appears from the logs that on each deploy DataFlow builds up new worker nodes from scratch: I see Linux boot messages going by.
Is it possible to drain the pipeline without tearing down the worker nodes so the next deployment can reuse them?
rewriting Inigo's answer here:
Answering the original question, no, there's no way to do that. Updating should be the way to go. I was not aware it was marked as experimental (probably we should change that), but the update approach has not changed in the last 3 i have been using DF. About the special cases of update not working, supposing your feature existed, the workers would still need the new code, so no really much to save, and update should work in most of the other cases.