I saw the post How to manage versions of docker image in AWS ECS? and didn’t get a good answer for the question.
In case of updating container image version (for example , from alpine:1.0.0 to alpine 1.0.1)
What is the best practice to update the container image in the task definition? I’m using only one container per task definition.
As far as I understand there are two alternatives:
Create new revision of task definition
Create new task definition that its name contains the version of the image.
The pros of the first option are that I’m creating only one task definition, but the cons are that in case that I want to create new revision only if the definition was changed, then I need to describe the task, get the image from the container list, and then compare the version with the new image version.
Regarding the second option, is that I can see exactly if I created there is a task definition that contains my image or not. The cons are that I will create new task definition for every image version.
In both options, how should I handle the deregister logic?
Probably I missed something so would appreciate your answer.
Thanks!
I've only ever seen the first alternative (Create new revision of task definition) used. If you are using Infrastructure as Code, such as CloudFormation or Terraform, then all the cons you have listed for that are no longer present.
"In both options, how should I handle the deregister logic?"
Just update the ECS service to use the latest version of the task definition. ECS will then deploy a new task, migrate the traffic to that task, and shut down the old task. You don't need to do anything else at that point. There is no special logic you need to implement yourself to deregister anything.
Related
I have an ECS task which has 2 containers using 2 different images, both hosted in ECR. There are 2 GitHub repos for the two images (app and api), and a third repo for my IaC code (infra). I am managing my AWS infrastructure using Terraform Cloud. The ECS task definition is defined there using Cloudposse's ecs-alb-service-task, with the containers defined using ecs-container-definition. Presently I'm using latest as the image tag in the task definition defined in Terraform.
I am using CircleCI to build the Docker containers when I push changes to GitHub. I am tagging each image with latest and the variable ${CIRCLE_SHA1}. Both repos also update the task definition using the aws-ecs orb's deploy-service-update job, setting the tag used by each container image to the SHA1 (not latest). Example:
container-image-name-updates: "container=api,tag=${CIRCLE_SHA1}"
When I push code to the repo for e.g. api, a new version of the task definition is created, the service's version is updated, and the existing task is restarted using the new version. So far so good.
The problem is that when I update the infrastructure with Terraform, the service isn't behaving as I would expect. The ecs-alb-service-task has a boolean called ignore_changes_task_definition, which is true by default.
When I leave it as true, Terraform Cloud successfully creates a new version whenever I Apply changes to the task definition. (A recent example was to update environment variables.) BUT it doesn't update the version used by the service, so the service carries on using the old version. Even if I stop a task, it will respawn using the old version. I have to manually go in and use the Update flow, or push changes to one of the code repos. Then CircleCI will create yet aother version of the task definition and update the service.
If I instead set this to false, Terraform Cloud will undo the changes to the service performed by CircleCI. It will reset the task definition version to the last version it created itself!
So I have three questions:
How can I get Terraform to play nice with the task definitions created by CircleCI, while also updating the service itself if I ever change it via Terraform?
Is it a problem to be making changes to the task definition from THREE different places?
Is it a problem that the image tag is latest in Terraform (because I don't know what the SHA1 is)?
I'd really appreciate some guidance on how to properly set up this CI flow. I have found next to nothing online about how to use Terraform Cloud with CI products.
I have learned a bit more about this problem. It seems like the right solution is to use a CircleCI workflow to manage Terraform Cloud, instead of having the two services effectively competing with each other. By default Terraform Cloud will expect you to link a repo with it and it will auto-plan every time you push. But you can turn that off and use the terraform orb instead to run plan/apply via CircleCI.
You would still leave ignore_changes_task_definition set to true. Instead, you'd add another step to the workflow after the terraform/apply step has made the change. This would be aws-ecs/run-task, which should relaunch the service using the most recent task definition, which was (possibly) just created by the previous step. (See the task-definition parameter.)
I have decided that this isn't worth the effort for me, at least not at this time. The conflict between Terraform Cloud and CircleCI is annoying, but isn't that acute.
We deploy a Docker image that runs a simple Sinatra API to an ECS Fargate service. Right now our task definition defines the image using a :production tag. We want to use CodeDeploy for a blue/green deployment.
When code is changed - should we push a new image with the :production tag and force a new deployment on our service or instead use specific tags in our task definition (e.g. :97b9d390d869874c35c325632af0fc1c08e013cd) and create a new task revision then update our service to use this new task revision?
Our concern with the second approach is that we don't see any lifecycle rules around task revisions so will they just build up until we have tens/hundreds of thousands?
If we use the first approach, will CodeDeploy be able to roll back a failed deployment in the case there is an issue?
Short answer
In both cases there are no definition roll back if somehow your new image crashed but your current old task should still be alive. But if you are using health check and the current running task is below the required amount (might be due to overflow of user traffic,...etc), Fargate would start up new task with the latest task definition revision which contained the bad image.
Long answer
Since you are just asking CodeDeploy to start up task based on your image, it would create a new task definition that have your image's URI to pull the correct image. And that new task definition would always be used to start up new Fargate task.
So when Fargate found that it needs to create task, it would always try to use the latest revision which would always be the one with bad image.
The good thing is that your old image task if works correctly, it should still be alive, since the minimum running task would be 1 and since the other task is failing continuously, your old image task would not be decommissioned.
You can however overcome this by adding a CloudWatch event to trigger a lambda that either update new task revision with the good image tag or running current Fargate with the previous task definition revision. Here is an article from AWS about this: https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/
A bit more on how Fargate deployment work here and help your old task running when new deployment fail, it would first provision the new task, when all the new tasks is running good, it would decommission old task. So in case the new tasks does not run properly, old task should still be alive.
If I wanted to validate that that an ECS service is running the latest image for a tag, how would I do that?
I can:
describe_services to get the task definition
describe_task_definition to get the image associated
But that image is in whatever form that's in the task definition. If the task definition says service:1.1, that's a good start, but what if a new image has been pushed that is tagged service:1.1 since deployment? There's no way to tell from looking at the image in the task definition.
Maybe that makes sense because it is, after all, the definition, not the task itself. So what about describe_tasks? Looks promising. Except describe_tasks doesn't talk about the image at all. It does have a container ARN, but what good is that? I can't find any API call that uses container ARNs at all -- am I missing something?
Basically -- is there any way to identify the specific image down to the digest level that is running for each task on an ECS service so that you can tell if you should force a new deployment?
Confirmed by Amazon Support, there isn't currently a good way to validate that the image deployed on a given task is the same as the latest image pushed with the tag specified in the task definition.
It's not ideal, but I could compare the deployment's updatedAt and the task definition's image with the image's pushedAt, I suppose. That won't give me an explicit "which image am I using", but it will tell me "has the image tag been pushed since the service was updated?"
I want to create jobs in AWS Batch that vary on the image that is used to launch the container. I'd like to do this without creating a different Job Definition for each image. Is it possible to parameterize the image property using job definition parameters? If not, what's the best way to achieve this or do I have to just create job definitions on the fly in my application?
I would really love this functionality as well. Sadly, it appears the current answer is no.
Batch allows parameters, but they're only for the command.
AWS Batch Parameters
You may be able to find a workaround be using a :latest tag, but then you're buying a ticket to :latest hell.
My current solution is to use my CI pipeline to update all dev job definitions using the aws cli (describe-job-definitions then register-job-definition) on each tagged commit.
To keep my infrastructure-as-code consistent, I've moved the version for batch job definitions into an environment variable that I retrieve before running any terraform commands.
Typically you make a job definition for a docker image.
However that job definition and docker can certainly do anything you've programmed it to do so it can be multi-purpose and you pass in whatever parameter or command line you would like to execute.
You can override most of the parameters in a Job definition when you submit the job.
Let's say I've created an AMI from one of my EC2 instances. Now, I can add this manually to then LB or let the AutoScaling group to do it for me (based on the conditions I've provided). Up to this point everything is fine.
Now, let's say my developers have added a new functionality and I pull the new code on the existing instances. Note that the AMI is not updated at this point and still has the old code. My question is about how I should handle this situation so that when the autoscaling group creates a new instance from my AMI it'll be with the latest code.
Two ways come into my mind, please let me know if you have any other solutions:
a) keep AMIs updated all the time; meaning that whenever there's a pull-request, the old AMI should be removed (deleted) and replaced with the new one.
b) have a start-up script (cloud-init) on AMIs that will pull the latest code from repository on initial launch. (by storing the repository credentials on the instance and pulling the code directly from git)
Which one of these methods are better? and if both are not good, then what's the best practice to achieve this goal?
Given that anything (almost) can be automated using the AWS using the API; it would again fall down to the specific use case at hand.
At the outset, people would recommend having a base AMI with necessary packages installed and configured and have init script which would download the the source code is always the latest. The very important factor which needs to be counted here is the time taken to checkout or pull the code and configure the instance and make it ready to put to work. If that time period is very big - then it would be a bad idea to use that strategy for auto-scaling. As the warm up time combined with auto-scaling & cloud watch's statistics would result in a different reality [may be / may be not - but the probability is not zero]. This is when you might consider baking a new AMI frequently. This would enable you to minimize the time taken for the instance to prepare themselves for the war against the traffic.
I would recommend measuring and seeing which every is convenient and cost effective. It costs real money to pull down the down the instance and relaunch using the AMI; however thats the tradeoff you need to make.
While, I have answered little open ended; coz. the question is also little.
People have started using Chef, Ansible, Puppet which performs configuration management. These tools add a different level of automation altogether; you want to explore that option as well. A similar approach is using the Docker or other containers.
a) keep AMIs updated all the time; meaning that whenever there's a
pull-request, the old AMI should be removed (deleted) and replaced
with the new one.
You shouldn't store your source code in the AMI. That introduces a maintenance nightmare and issues with autoscaling as you have identified.
b) have a start-up script (cloud-init) on AMIs that will pull the
latest code from repository on initial launch. (by storing the
repository credentials on the instance and pulling the code directly
from git)
Which one of these methods are better? and if both are not good, then
what's the best practice to achieve this goal?
Your second item, downloading the source on server startup, is the correct way to go about this.
Other options would be the use of Amazon CodeDeploy or some other deployment service to deploy updates. A deployment service could also be used to deploy updates to existing instances while allowing new instances to download the latest code automatically at startup.