Replace ECS container instances in terraform setup - amazon-web-services

We have a terraform deployment that creates an auto-scaling group for EC2 instances that we use as docker hosts in an ECS cluster. On the cluster there are tasks running. Replacing the tasks (e.g. with a newer version) works fine (by creating a new task definition revision and updating the service -- AWS will perform a rolling update). However, how can I easily replace the EC2 host instances with newer ones without any downtime?
I'd like to do this to e.g. have a change to the ASG launch configuration take effect, for example switching to a different EC2 instance type.
I've tried a few things, here's what I think gets closest to what I want:
Drain one instance. The tasks will be distributed to the remaining instances.
Once no tasks are running in that instance anymore, terminate it.
Wait for the ASG to spin up a new instance.
Repeat steps 1 to 3 until all instances are new.
This works almost. The problem is that:
It's manual and therefore error prone.
After this process one of the instances (the last one that was spun up) is running 0 (zero) tasks.
Is there a better, automated way of doing this? Also, is there a way to re-distribute the tasks in an ECS cluster (without creating a new task revision)?

Prior to making changes make sure you have the ASG spanned across multiple availability zones and so are the containers. This ensures High Availability when instances are down in one Zone.
You can configure an update policy of Autoscaling group with AutoScalingRollingUpgrade where you can set MinInstanceInService and MinSuccessfulInstancesPercent to a higher value to maintain slow and safe rolling upgrade.
You may go through this documentation to find further tweaks. To automate this process, you can use terraform to update the ASG launch configuration, this will update the ASG with a new version of launch configuration and trigger a rolling upgrade.

Related

ECS is there a way to avoid downtime when I change instance type on Cloudformation?

I have created a cluster to run our test environment on Aws ECS everything seems to work fine including zero downtime deploy, But I realised that when I change instance types on Cloudformation for this cluster it brings all the instances down and my ELB starts to fail because there's no instances running to serve this requests.
The cluster is running using spot instances so my question is there by any chance a way to update instance types for spot instances without having the whole cluster down?
Do you have an AutoScaling group? This would allow you to change the launch template or config to have the new instances type. Then you would set the ASG desired and minimum counts to a higher number. Let the new instance type spin up, go into service in the target group. Then just delete the old instance and set your Auto scaling metrics back to normal.
Without an ASG, you could launch a new instance manually, place that instance in the ECS target group. Confirm that it joins the cluster and is running your service and task. Then delete the old instance.
You might want to break this activity in smaller chunks and do it one by one. You can write small cloudformation template as well because by default if you update the instance type then your instances will be restarted and to avoid zero downtime, you might have to do it one at a time.
However, there are two other ways that I can think of here but both will cost you money.
ASG: Create a new autoscaling group or use the existing one and change the launch configuration.
Blue/Green Deployment: Create the exact set of resources but this time with updated instance type and use Route53's weighted routing policy to control the traffic.
It solely depends upon the requirement, if you can pour money then go with above two approaches otherwise stick with the small deployments.

AWS Autoscaling updating

You can create new Launch Configuration (updating AMI or whatever) and attach this with an existing Autoscaling Group. Per AWS Docs: After you change the launch configuration for an Auto Scaling group, any new instances are launched using the new configuration options, but existing instances are not affected.
How do you force this? Meaning relaunch all new instances now (with the new AMI). Do I have to delete the existing Autoscaling Group and create a new Autoscaling Group (with new Config)? Or I simple delete existing instances (one by one manually) and then ASG relaunch with new AMI. Any best practices/gotchas?
CloudFormation has the RollingUpdate flag (not sure of this outside of CF)
Thanks
AWS has some OOTB solutions for this, CloudFormation (like you say), Elastic Beanstalk (built on top of CF), and CodeDeploy blue-green deployments (I've not tried this).
Personally for our SQS polling ASG, we do a "cold deploy" i.e. only "deploy" when there are no messages to process (and hence, due a scaling policy, no instances). It's been really effective.
A deploy can be done safely whilst there are messages, provided that you set scale-in-protection on the instance during message processing (and remove it and wait briefly before polling):
set desired-capacity to 0
wait a bit (for there to be no instances running)
set desired-capacity back to N.
Note: you can do this all in the console.
You can code a solution yourself that does this... but I probably wouldn't.
Be careful:
simple delete existing instances (one by one manually)
Whether you can do this, or depends on whether the instances are still handling requests/processing (usually you can't simply terminate an instance without dropping service).
I recommend Elastic Beanstalk which gives a rolling update feature for free and is very easy to get started. I've not tried the CodeDeploy blue-green but it looks interesting. If you want more advanced behavior (or are already using it) look into Cloud Formation... do not code your own solution for rolling deployments: just use CloudFormation.
if your issue is with "in flight" requests simply enable connection draining or increase de-registration delay of the ELB or "target groups" attached with the ASG. You can set a value up to one hour.
When you enable connection draining, you can specify a maximum time for the load balancer to keep connections alive before reporting the instance as de-registered. The maximum timeout value can be set between 1 and 3,600 seconds (the default is 300 seconds). When the maximum time limit is reached, the load balancer forcibly closes connections to the de-registering instance.
Then you can detached old instances.
If you detach an instance from an Auto Scaling group that has an attached load balancer, the instance is deregistered from the load balancer. If you detach an instance from an Auto Scaling group that has an attached target group, the instance is deregistered from the target group. If connection draining is enabled for your load balancer, Auto Scaling waits for in-flight requests to complete.
If you don't want to do any manual scaling I guess the best approach is to changing the termination policy to OldestInstance and leave the ASG as it is. When the scale-in activity happens ASG will automatically terminate the old instances.(in your case old launch config instances)
OldestInstance. Auto Scaling terminates the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.

Best DevOps use of AWS Auto-scaling Groups?

I've been working on a DevOps pipeline for an application hosted on AWS. I want to make an improvement to my current setup, but I'm not sure the best way to go about doing it. My current set up is as follows:
ASG behind ELB
Desired capacity: 1
Min capacity: 1
Max capacity: 1
Code deployment process:
move deployable to S3
terminate instance in ASG
new instance is automatically provisioned
new instance pulls down deployable in user data
The problem with this setup is that the environment is down from when the instance is terminated to when the new instance has been completely provisioned.
I've been thinking about ways that I can improve this process to eliminate the downtime, and I've come up with two possible solutions:
SOLUTION #1:
ASG behind ELB
Desired capacity: 1
Min capacity: 1
Max capacity: 2
Code deployment process:
move deployable to S3
launch new instance into ASG
new instance pulls down deployable in user data
terminate instance with old deployable
With this solution, there is always at least one instance capable of serving requests in the ASG. The problem is, ASGs don't seem to support a simple operation of manually calling on it to spin up a new instance. (They only launch new instances when the scaling policies call for it.) You can attach existing instances to the group, but this causes the desired capacity value to increase, which I don't want.
SOLUTION #2:
ASG behind ELB
Desired capacity: 2
Min capacity: 2
Max capacity: 2
Code deployment process:
move deployable to S3
terminate instance-A
new instance-A is automatically provisioned
instance-A pulls down new deployable by user data script
terminate instance-B
new instance-B is automatically provisioned
instance-B pulls down new deployable by user data script
Just as with the previous solution, there is always at least one instance available to serve requests. The problem is, there are usually two instances, even when only one is needed. Additionally, the code deployment process seems needlessly complicated.
So which is better: solution #1, solution #2, or some other solution I haven't thought of yet? Also a quick disclaimer: I understand that I'm using ASGs for something other than their intended purpose, but it seemed the best way to implement automated code deployments along AWS's "EC2 instances are cattle" philosophy.
The term you are looking for is "zero-downtime deployment."
The problem is, ASGs don't seem to support a simple operation of manually calling on it to spin up a new instance. (They only launch new instances when the scaling policies call for it.) You can attach existing instances to the group, but this causes the desired capacity value to increase, which I don't want.
If you change desired capacity yourself (e.g. via an API call), the Auto Scaling Group will automatically launch an extra instance for you. For example, here is a simple way to implement zero-downtime deployment for your Auto Scaling Group (ASG):
Run the ASG behind an Elastic Load Balancer (ELB).
Initially, the desired capacity is 1, so you have just one EC2 Instance in the ASG.
To deploy new code, you first create a new launch configuration with the new code (e.g. new AMI or new User Data).
Next, you change the desired capacity from 1 to 2. The ASG will automatically launch a new EC2 Instance with the new launch configuration.
Once the new EC2 Instance is up and running and registered in your ELB, you change the desired capacity from 2 back to 1, and the ASG will automatically terminate the older EC2 Instance.
You can implement this manually or use existing tools to do it for you, such as:
Define your ASG using CloudFormation and specify an UpdatePolicy that does a zero-downtime rolling deployment.
Define your ASG using Terraform and use the create_before_destroy lifecycle property to do a zero-downtime (sort-of) blue-green deployment as described here.
Define your ASG using Ansible and use the serial keyword to do rolling upgrades.
Use the aws-ha-release script.
You can learn more about the trade-offs between tools like Terraform, CloudFormation, Ansible, Chef, and Puppet here.
Even though this is a DevOps pipeline and not a production environment, what you are describing sounds like a blue/green deployment scenario in which you want to be able to switch between environments without downtime. I think the best answer is largely specific to your requirements (which we don't 100% know), but a guide like The DOs and DON'Ts of Blue/Green Deployment will be beneficial in finding the best way to achieve your goals, whether it is #1, #2, or something else.

How to update custom AMIs using packer and integrated with auto-scaling group?

Goal: To maintain the minimum startup period for bringing up instances to load balance and reduce the troubleshooting time.
Approach:
Create base custom-AMI for ec2-instances
Update/rebundle the custom AMI on every release and s/w patches (code & software updates related to the healthy running instance).
2.a. Packer/any CI usage for update is possible? If so, how? (unable to find a step-by-step approach in documentations of package)
Automate the step 1 and step 2 using chef.
Integrate this AMI in the Auto scaling group (experimented this).
Map the Load balancer to ASG [done].
Maintain the desired count of Instances by bringing up instances from updated-AMI in ASG with LB upon failure.
Crux: Terminate the unhealthy instance and bring up the healthy instance with ami asap.
--
P.S:
I have gone through many posts from [http://blog.kik.com/2016/03/09/using-packer-io-to-optimize-and-manage-ami-creation/] and https://alestic.com/.
Using docker is rolled out of discussion.
But still unable to figure out a clear way to do it.
The simplest way to swap out a new AMI in an existing ASG is to update the launch config and then one by one kill any instance using the old AMI ID. The ASG will bring up new instances as needed, which should use the new AMI. If you want to get fancier (like keeping old instances alive for quick rollback) check out tools like Spinnaker which brings up each new AMI as a new corresponding ASG and then remaps the ELB to swap traffic over if no problems are detected, and then later on when you are sure the deploy is good it kills off the old ASG and all associated instances.

Auto Scaling Group launch config changes

I wonder if there is a simple way or best practices on how to ensure all instances within an AutoScaling group have been launched with the current launch-configuration of that AutoScaling group.
To give an example, imagine an auto-scaling group called www-asg with 4 desired instances running webservers behind an ELB. I want to change the AMI or the userdata used to start instances of this auto-scaling group. So I create a new launch configuration www-cfg-v2 and update www-asg to use that.
# create new launch config
as-create-launch-config www-cfg-v2 \
--image-id 'ami-xxxxxxxx' --instance-type m1.small \
--group web,asg-www --user-data "..."
# update my asg to use new config
as-update-auto-scaling-group www-asg --launch-configuration www-cfg-v2
By now all 4 running instances still use the old launch configuration. I wonder if there is a simple way of replacing all running instances with new instances to enforce the new configuration, but always ensure that the minimum of instances is kept running.
My current way of achieving this is as follows..
save list of current running instances for given autoscaling group
temporarily increase the number of desired instances +1
wait for the new instance to be available
terminate one instance from the list via
as-terminate-instance-in-auto-scaling-group i-XXXX \
--no-decrement-desired-capacity --force
wait for the replacement instance to be available
if more than 1 instance is left repeat with 4.
terminate last instance from the list via
as-terminate-instance-in-auto-scaling-group i-XXXX \
--decrement-desired-capacity --force
done, all instances should now run with same launch config
I have mostly automated this procedure but I feel there must be some better way of achieving the same goal. Anyone knows a better more efficient way?
mathias
Also posted this question in the official AWS EC2 Forum.
Old question I know but I thought I would share my approach.
I change the launch config for an ASG, I then launch the same number of instances as are currently in the ASG, as they become available (automated testing) they are attached to the ASG. once the machines have been added our deployment system updates our varnish loadbalancer(s) to use the new instances and the old instances are terminated.
All of the above is automated and a full site scale switch takes about 5 minutes depending on the launch time.
incase you are wondering, we use SNS to handle updating varnish when instances are added or removed or in the case of our loadbalancers scaling (which almost never happens) the deployment system will update our route53 config instead.
I think that pretty much covers everything
This isn't a lot different, but you could:
create the new LC
create a new ASG using the new LC
scale down the old ASG
delete the old asg and LC
I do deployments this way, and it's in my experience to roll from one ASG to another, rather than having to jump back and forth. But as I noted, it's not a huge difference.
It might be worth looking at: https://github.com/Netflix/asgard , which is a Netflix OSS tool for managing autoscaling groups. I ended up not using it, but it's pretty interesting nonetheless.