ECS deployment and matching number of running tasks - amazon-web-services

Scenario:
ECS Fargate.
Say I have a “desired count” of 2 tasks.
The system takes on some load and auto scales to 6 tasks.
If I deploy during this time, ECS seems to kill off my actual running capacity back down to 2 tasks. This causes service failures b/c the system can no longer handle the actual load and must now scale back up.
All the docs I’ve come across indicate using “minimum healthy percent” and “maximum percent” to help control deployment sizes, but these refer back do the DESIRED count of tasks, not the actual number running on the actual system being deployed to.
Any idea if there is a way to say: “please just match the number of tasks running, or some percentage of such when spinning up new tasks from deploy”?
Deploy is Cloudformation via CodePipeline.

The DesiredCount parameter in CFN is now optional. See this issue for background.
From the issue:
We are making following improvements to ECS integration with Cloudformation:
- DesiredCount becomes an optional field in CFN CreateService and DesiredCount=1 would be used as the default value if it is missing
- UpdateService will also omit DesiredCount when it is missing in CFN template
Customers expect the current behavior (i.e. UpdateService will use the set DesiredCount in the CFN template) can just add DesiredCount in their CFN templates. Existing customers wanting the new behavior can get it by removing DesiredCount from their CFN templates. The changes will be released soon.

Related

AWS Batch: How to check if requested fargate resources match what's provided

I have a job definition in AWS Batch for a job to be run on Fargate Spot. The definition defines among other things the required VCPU and MEMORY values for the task.
Here's an example how it looks like in cloudformation:
ResourceRequirements:
-
Type: VCPU
Value: 2
-
Type: MEMORY
Value: 4096
and in the console:
So far so good. When I submit a task with this definition, a Fargate container gets spun up in ECS. But there's no indication as to what resources this container has.
The relevant details in the running ECS task are empty/not-defined:
and the script I'm running gives an indication that there's more memory available than requested.
So how can I (in the most easy way) confirm that the requested resources for this task is what's being provided?
My concern is that once the processing scales up, over-provided resources would mean significant excess cost.

Replace ECS tasks in cluster using AWS cli

I'm trying to replace the current tasks in an ECS cluster.
Context:
I have 2 tasks (and a maximum of 4)
Every time I make a change to the docker image, the image is built, tagged, and pushed to ECR (through Jenkins). I wanted to add a timer and after x minutes, replace the current tasks with new ones (also in the CI/CD)
I tried
aws ecs update-service --cluster myCluster --service myService --task-definition myTaskDef
but it didn't work.
Also, several suggestions that I found in StackOverflow and forums, but in the best cases, I ended with 4 tasks, while, I just want to replace the current ones with new ones.
Is this possible using the CLI?
First thing as mentioned by #Marcin, in such deployed where --force-new-deployment is not specified and no change in the task definition revision the deployment will ignore by ECS agent.
The second thing that you are seeing replica after deployment is minimumHealthyPercent and maximumPercent as the service scheduler uses these parameters to determine the deployment strategy.
minimumHealthyPercent
If minimumHealthyPercent is below 100%, the scheduler can ignore
desiredCount temporarily during a deployment. For example, if
desiredCount is four tasks, a minimum of 50% allows the scheduler to
stop two existing tasks before starting two new tasks. Tasks for
services that do not use a load balancer are considered healthy if
they are in the RUNNING state. Tasks for services that use a load
balancer are considered healthy if they are in the RUNNING state and
the container instance they are hosted on is reported as healthy by
the load balancer.
maximumPercent
The maximumPercent parameter represents an upper limit on the number of running tasks during a deployment, which enables you to define the deployment batch size. For example, if desiredCount is four tasks, a maximum of 200% starts four new tasks before stopping the four older tasks (provided that the cluster resources required to do this are available).
Modifies the parameters of a service
So setting minimumHealthyPercent is to 50% the scheduled will stop one exiting task before starting one new task. setting it will 0 then you may see the bad gateway from LB as it will stop both exiting tasks before starting two one.
If you still not able to control the flow then pass the --desired-count
aws ecs update-service --cluster test --service test --task-definition test --force-new-deployment --desired-count 2
Usually you would use --force-new-deployment parameter of update-service:
Whether to force a new deployment of the service. Deployments are not forced by default. You can use this option to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.

Blue/Green deployments with Auto Scaling Groups, CloudFormation and CodeDeploy

I have tried setting up a Blue/Green deployment by copying AutoScalingGroup, however this leaves the CloudFormation stack detached from its original resources as CodeDeploy creates a new copy and deletes the original. I understand from another post (https://forums.aws.amazon.com/thread.jspa?messageID=861085) that AWS are developing improvements for this, however for now I am trying the following workaround. Any ideas would be really helpful.
CloudFormation creates the following:
Elastic Load Balancer
Target Group
AutoScalingGroup One (with LaunchConfiguration)
AutoScalingGroup Two (same as one but has no instances)
DeploymentGroup (with In-Place DeploymentStyle) which deploys a revision to AutoScalingGroup One
After CloudFormation finishes, I do the following manually in the console:
I update the created Deployment Group to be of Deployment Style Blue/Green and set its original environment to be AutoScalingGroup One.
I add an instance to AutoScalingGroup Two
I create a deployment in CodeDeploy. However, this does not work as when a new instance is attached to AutoScalingGroup Two, it gets added to the TargetGroup immediately and does not pass health checks.
Any ideas on how to implement a set of resources with CloudFormation that will make blue green deployments simple, i.e. one click in CodeDeploy and CloudFormation resources still remaining intact?
With regard to the initial issue you are describing, did you experiment with the Health Check Grace Period? That should prevent the problems you describe with the failing health check when the instance hits the target group.
An alternative approach (which has plenty of its own downsides) is to adapt the CloudFormation template to compensate for the behavior when CodeDeploy replaces the ASG in a Blue-Green deployment.
Within the ASG template, create a "yes/no" parameter called
"ManageAutoScalingGroup". Create the ASG conditionally on the value
of this parameter being "yes". Set a deletion policy on the ASG of
retain so that CloudFormation will leave the group in place when the
parameter is changed to "no".
Spin up the group with a default "yes"
on this parameter.
Once the instances are healthy, and CodeDeploy has completed an initial in-place deployment, you can change the DeploymentGroup to use Blue-Green where CodeDeploy will replace your ASG.
Be sure to update the ASG and change ManageAutoScalingGroup to "no". CloudFormation will delete the reference from your stack, but it will leave the resource in place.
This will give you the one-click deployments you desire through CodeDeploy, but be aware that it comes with some costs:
CodeDeploy will not copy the TargetGroup parameter of your Auto Scaling Group (as described by others in https://forums.aws.amazon.com/thread.jspa?threadID=249406&tstart=0). You should be able to work around this with a clever use of CloudWatch event rules and SSM Automation to mark the instance unhealthy when the ALB changes its status.
The copies that CodeDeploy produces seem to be fairly unreliable. At least once, I've seen my LaunchTemplate version reset to an incorrect value. I've also run into scenarios where the deployment group lost track of which ASG it was supposed to track.
Continuing to apply changes from your template to the ASG is a hassle. The process to "refresh" the group is: 1) Revert the parameter described earlier such that CloudFormation will generate a new group. 2) Modify the deployment group to target this group and complete an in-place deployment. 3) Modify the deployment group to restore Blue-Green deployments and update your stack accordingly.
I'm not too impressed with CodeDeploy in this department. I'd love to see them work in the same fashion as an ASG that is set to replace itself on application of a new LaunchTemplate version. If you are feeling a bit ambitious, you could mimic this behavior by leveraging Step Functions with ASG instance lifecycle hooks. This is a solution that I'm considering once I have the time.

CloudFormation, CodeDeploy, ELB & Auto-Scaling Group

I am trying to build a stack with an ELB, an Auto-Scaling Group and a Pipeline (with CodeBuild and CodeDeploy).
I can't understand how it is supposed to work:
the auto-scaling group is starting two instances and wait X minutes before starting to check the instances state
the CodeDeploy application deployment group is waiting for the Auto-Scaling group to be created and ready
the pipeline takes about 10 minutes to start deploying the application
My issue is when I create the stack, it looks like there is a loop: AG requires an application from CodeDeploy and CodeDeploy requires an AG stabilized. To be clear, when the application is ready to deploy, my Auto-Scaling group is already starting to terminate instances and starting new ones, so the CodeDeployment is trying to deploy to instances already terminated or terminating.
I don't really want to configure HealthCheckGracePeriod and PauseTime to be ~10-15 minutes... it is way too long.
Are there any best practices for CloudFormation + ELB + AG + CodeDeploy via a Pipeline?
What should be the steps to achieve that?
Thank you!
This stopping/staring the instances is most probably linked to the Deployment Type: in-place vs. blue/green.
I have tried both in my setup, and I will try to summarize how they work.
Let's say that for this example, you have an Autoscaling group which at the time of deploying the application has 2 running instances and the deployment configuration is OneAtATime. Traffic is controlled by the Elastic Load Balancer. Then:
In-place deployment:
CodeDeploy gets notified of a new revision available.
It tells the ELB to stop directing traffic to 1st instance.
Once traffic to one instance is stopped, it starts the deployment process: Stop the application, download bundle etc.
If the deployment is successful (validate service hook returned 0), it tells ELB to resume traffic to that instance.
At this point, 1 instance is running the old code and 1 is running the new code.
Right after the ELB stops traffic to the 2nd instance, and repeats the deployment process there.
Important note:
With ELB enabled, the time it takes to block traffic to instance before deployment, and time it takes to allow traffic after it are directly dependent on your health check: time = Healthy threshold * Interval.
Blue/green deployment:
CodeDeploy gets notified of a new revision available.
It copies your Autoscaling Group: the same configuration of the group (including scaling policies, scheduled actions etc.) and the same number of instances (using same AMI as your original AG) there were there at the start of deployment - in our case 2.
At this point, there is no traffic going to the new AG.
CodeDeploy performs all the usual installation steps to one machine.
If successful, it deploys to the second machine too.
It directs traffic from the instances in your old AG to the new AG.
Once traffic is completely re-routed, it deletes the old AG and terminates all its instances (after a period specified in Deployment Settings - this option is only available if you select Blue/Green)
Now ELB is serving only the new AG.
From experience:
Blue/green deployment is a bit slower, since you need to wait for the
machines to boot up, but you get a much safer and fail-proof deployment.
In general I would stick with Blue/green, with load balancer
enabled and the Deployment Configuration: AllAtOnce (if it fails,
customers won't be affected since the instances won't be receiving
traffic. But it will be 2x as fast since you deploy in parallel
rather than sequentially).
If your health checks and validate
service are throughout enough, you can probably delete the original
AG with minimal waiting time (5 minutes at the time of writing the
post).

Updating ECS service with Terraform fails to place a new task

After pushing a new image of my container I use Terraform apply to update the task definition. This seems to work fine but in the ECS service list of tasks I can see the task as inactive and I have an event:
service blahblah was unable to place a task because no container instance met all of its requirements. The closest matching container-instance [guid here] is already using a port required by your task.
The thing is, the site is still active and working.
This is more of an ECS issue than a Terraform issue because Terraform is updating your task definition and updating the service to use the new task definition but ECS is unable to schedule new tasks on to the container instances because you're (presumably) defining a specific port that the container must run on and directly mapping it to the host or using host networking instead of bridge (or the new aws-vpc CNI plugin).
ECS has a couple of parameters to control the behaviour of an update to the service: minimum healthy percent and maximum healthy percent. By default these are set to 100% and 200% respectively meaning that ECS will attempt to deploy a new task matching the new task definition and wait for it to be considered healthy (such as passing ELB health checks) before terminating the old tasks.
In your case you have as many tasks as you have container instances in your cluster and so when it attempts to schedule a new task on to the cluster it is unable to place it because the port is already bound to by the old task. You could also find yourself in this position if you had placement constraints on your task/service.
Because the minimum healthy percent is set to 100% it is unable to schedule the removal of any of the old tasks that would then free up a placement option for a new task.
You could have more container instances in the cluster than you have instances of the task running which would allow ECS to deploy new tasks before removing old tasks from the other instances or you could change the minimum healthy percent (deployment_minimum_healthy_percent in Terraform's ECS service resource) to a number less than 100 that allows deployments to happen.
For example, if you normally deploy 3 instances of the task in the service then setting the minimum healthy percent to 50% would allow ECS to remove one task from the service before scheduling a new task matching the new task definition. It would then proceed with a rolling upgrade, making sure the new task is healthy before replacing the old task.
Setting the minimum healthy percent to 0% would mean that ECS can stop all of the tasks running before starting new tasks but this would obviously lead to a potential (but not guaranteed) service interruption.
Alternatively you could remove the placement constraint by switching away from host networking if that is viable for your service.