Scale-in Concourse workers without killing tasks - amazon-web-services

Each worker runs multiple tasks. If we have a lot of tasks we'll need multiple workers. In order to save resources we'd like to elastically scale workers in and out, according to supply (spare capacity) and demand (pending tasks).
Scaling out is easy: add more nodes, they register themselves with TSA and start working.
Scaling in is trickier: one needs to wait for its tasks to finish before killing a worker's instance. Otherwise they'll have to restart on another worker. That's fine for small tasks but for longer ones that might not be acceptable.
One possible solution on AWS would be to use Autoscaling Lifecycle Hooks to synchronously tell the worker to not accept any more tasks and return when all are finished, then kill it. The Concourse Worker API doesn't have any such operation though.
Is there a way to implement safe scaling in of Concourse workers?
If the answer is "don't worry, Bosh will take care of it" I'd like to know what those mechanics are as I probably won't be using it.

You have to use the concourse binary from the command-line, on the host that runs the ATC (which is the concourse scheduler and web interface):
concourse --help
Usage:
concourse [OPTIONS] <command>
Application Options:
-v, --version Print the version of Concourse and exit [$CONCOURSE_VERSION]
Help Options:
-h, --help Show this help message
Available commands:
land-worker Safely drain a worker's assignments for temporary downtime.
retire-worker Safely remove a worker from the cluster permanently.
web Run the web UI and build scheduler.
worker Run and register a worker.
So it looks like you could hook to the Autoscaling Lifecycle service something that calls land-worker and then retire-worker (not sure whether retire-worker would be enough), once you figure out which worker you want to spin down...
When you spin the same worker back, you might have to be careful with the worker name, I seems to remember that sometimes the ATC gets confused, you will have to experiment with that (whether you can keep the same name or change it).

You can create a Lifecyle hook on your Concourse worker ASG:
Type: AWS::AutoScaling::LifecycleHook
Properties:
AutoScalingGroupName: !Ref ConcourseWorkerASG
DefaultResult: CONTINUE / ABANDON
HeartbeatTimeout: 900 # 15 minutes for example
LifecycleHookName: lchname
LifecycleTransition: "autoscaling:EC2_INSTANCE_TERMINATING"
Use a script to retire the worker, something along the lines of
lch.sh
#!/bin/bash
TYPE=$(cat /opt/concourse/type)
tsa_host=zz
instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id/)
lifecycleState=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].LifecycleState' --output text --region eu-west-1)
if [ "$TYPE" == "worker" ]; then
if [ "$lifecycleState" == "Terminating:Wait" ]; then
asg=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text --region eu-west-1)
/opt/concourse/concourse/bin/concourse retire-worker \
--name $HOSTNAME \
--tsa-host ${tsa_host}:2222 \
--tsa-public-key some_tsa_host_key.pub \
--tsa-worker-private-key some_worker_key
sleep 5m
systemctl stop your_concourse_service
aws autoscaling complete-lifecycle-action \
--instance-id $instance_id \
--auto-scaling-group-name $asg \
--lifecycle-hook-name "lchname" \
--lifecycle-action-result "CONTINUE" \
--region eu-west-1
fi
fi
then schedule a cronjob, for example via Ansible:
- name: List lch.sh as cronjob
cron:
name: "check asg lch for retiring the worker"
minute: "*/5" # run every 5 minutes
job: "/opt/concourse/lch.sh"

Related

How to avoid downtime when restarting ECS Fargate services?

I have this bash script:
#!/bin/bash
myClusterId="myCluster"
for service in $(aws ecs list-services --cluster $myClusterId --query "serviceArns[*]" | jq -r 'to_entries[] | .value | sub(".*/";"")'); do
for task in $( aws ecs list-tasks --cluster $myClusterId --service-name $service --desired-status 'RUNNING' --no-paginate --output text --query 'taskArns[*]' ) ; do
aws ecs stop-task --cluster $myClusterId --task $task --reason "Restarted using bash script" > /dev/null 2>&1
done
done
In short, it will restart all my ECS Fargate tasks under myCluster (excluding scheduled tasks triggered by CloudWatch Rules). It's working fine so far.
All my services have minHealthyPercent set to 100 and maxHealthyPercent set to 200. But, I noticed that it didn't keep any healthy tasks during the restart process. All tasks get killed immedietly and my load balancer throws 503 Service Temporarily Unavailable error when new tasks are in pending/provisioning state.
Am I missing something in my script? How do I correctly perform no-downtime services restart process using AWS CLI?
The parameters maximumPercent and minimumHealthyPercent are only used during rolling updates of your ECS service:
The number of tasks that Amazon ECS adds or removes from the service during a rolling update is controlled by the deployment configuration. A deployment configuration consists of the minimum and maximum number of tasks allowed during a service deployment.
Restarting a task is not considered as a new deployment.
To rectify the issue, there are few choices:
include a sleep in your for loop. Its the crudest way, but fastest to implement for testing.
use describe-tasks in the for loop to pull the state of the task just terminated. Proceed with restarting next task only when the state of the most recently restarted one will be RUNNING.
I think your best option would be to do a blue/green deployment through CodeDeploy, assuming you use an Elastic Load Balancer. The blue/green deployment will automatically detect any error and stop the deployment if required.
https://aws.amazon.com/blogs/devops/use-aws-codedeploy-to-implement-blue-green-deployments-for-aws-fargate-and-amazon-ecs/

How to simply "recycle/reboot" running tasks in AWS ECS with no change in image/source-code?

I have simple query : what is the best way to simply recycle/reboot a service having 2 tasks using AWS ECS console without any actual change being deployed ?
Currently I need to update service and set tasks count from 2 to 0 and wait for tasks to drain out. Then I will set tasks count from 0 to 2 to bring it up. This is how recycle/reboot 2 tasks of a service.
I need to do this sometimes due to internal app error and just want to reboot them without any actual change which resolves my problem.
AWS provides one option (Force new deployment checkbox) which is not helping and it works if there is a change in image ? Wish AWS could provide one option as "Recycle a service(tasks)" which will start 2 new tasks and drain out 2 existing tasks.
What could be the best and easiest way do it using AWS Console or even AWS API/CLI ?
If you stop the tasks, ECS will launch new ones to satisfy the desired count. That's fairly easy in the ECS console, just select the tasks in the list of tasks and choose Stop in the Action dropdown.
Using the aws CLI you can get a list of the tasks to kill using:
aws ecs list-tasks --service-name my-service
to delete each task use:
aws ecs stop-task --task %1
where %1 is the ARN of the task as provided by the first command.
Here is a command that combines both commands above. It will kill all the tasks of a given service:
SVC=your-service-name-here
aws ecs list-tasks --service-name $SVC --output text | cut -f2 | perl -ne 'system("aws ecs stop-task --task $_")'

Can I access ECS Fargate Task tags after the Task has exited or stopped running?

I want access to the tags of an ECS Task after it has stopped and the container has exited.
I launch a task using aws ecs run-task and attach a tag to it. I'm able to do this using the --tags option in this function, but I can only access the tags until the task completes. Once the task finishes and the container exits, I can't access the tags anymore. Is there a way to get tags of resources that are NOT currently running?
This is the aws-cli command I'm using to launch a task with tags:
aws ecs run-task \
--cluster ${CLUSTER}\
--task-definition ${TASK_NAME}-${TASK_ENV} \
--launch-type FARGATE \
--network-configuration "${AWS_VPC_CONFIGURATION}" \
--tags key='testKey',value='1'\
--enable-ecs-managed-tags \
;
I have tried using aws ecs list-tags-for-resource --resource-arn ${ARN} but it only shows tags if the task is still running. If I try this on a task that has already completed/exited with exitcode 0, i get this error
An error occurred (InvalidParameterException) when calling the
ListTagsForResource operation: The specified task is stopped. Specify
a running task and try again.
I also tried aws ecs describe-tasks but that also returns an empty array once the task has exited - "tags": [], and no actual tag values, even though the task was launched with tags.
NOTE: in my use case Task Definitions DO NOT have tags, i'm assigning a tag when the run task command is executed.
Even running ECS on EC2, the tasks are volatile. They don't hang around for long after failure.
You can see info, like return codes and such, on them for maybe a few hours at best after stop.
On Fargate, they seem to get harvested even more aggressively, so if you are trying to collect metrics or something, it's probably not a good idea to try and rely on harvesting info from stopped tasks. Rather store the info somewhere else more permanent before exiting and retrieve as needed.

aws command line interface - aws ec2 wait - Max attempts exceeded

I am working on shell script, witch does follow:
creates snapshot of EBS Volume;
creates AMI image based on this snapshot.
1) I use follow command to create snapshot:
SNAPSHOT_ID=$(aws ec2 create-snapshot "${DRYRUN}" --volume-id "${ROOT_VOLUME_ID}" --description "${SNAPSHOT_DESCRIPTION}" --query 'SnapshotId')
2) I use waiter to wait complete state:
aws ec2 wait snapshot-completed --snapshot-ids "${SNAPSHOT_ID}"
When I test it with EBS Volume 8 GB size everything goes well.
When it is 40 GB, I have an exception:
Waiter SnapshotCompleted failed: Max attempts exceeded
Probably, 40 GB takes more time, then 8 GB one, just need to wait.
AWS Docs (http://docs.aws.amazon.com/cli/latest/reference/ec2/wait/snapshot-completed.html) don't have any timeout or attempts quantity option.
May be some of you have faced the same issue?
So, finally, I used follow way to solve it:
Create snapshot
Use loop to check exit status of command aws ec2 wait snapshot-completed
If exit status is not 0 then print current state, progress and run waiter again.
# Create snapshot
SNAPSHOT_DESCRIPTION="Snapshot of Primary frontend instance $(date +%Y-%m-%d)"
SNAPSHOT_ID=$(aws ec2 create-snapshot "${DRYRUN}" --volume-id "${ROOT_VOLUME_ID}" --description "${SNAPSHOT_DESCRIPTION}" --query 'SnapshotId')
while [ "${exit_status}" != "0" ]
do
SNAPSHOT_STATE="$(aws ec2 describe-snapshots --filters Name=snapshot-id,Values=${SNAPSHOT_ID} --query 'Snapshots[0].State')"
SNAPSHOT_PROGRESS="$(aws ec2 describe-snapshots --filters Name=snapshot-id,Values=${SNAPSHOT_ID} --query 'Snapshots[0].Progress')"
echo "### Snapshot id ${SNAPSHOT_ID} creation: state is ${SNAPSHOT_STATE}, ${SNAPSHOT_PROGRESS}%..."
aws ec2 wait snapshot-completed --snapshot-ids "${SNAPSHOT_ID}"
exit_status="$?"
done
If you have something that can improve it, please share with us.
you should probably use until in bash, looks a bit cleaner and you don't have to repeat.
echo "waiting for snapshot $snapshot"
until aws ec2 wait snapshot-completed --snapshot-ids $snapshot 2>/dev/null
do
do printf "\rsnapshot progress: %s" $progress;
sleep 10
progress=$(aws ec2 describe-snapshots --snapshot-ids $snapshot --query "Snapshots[*].Progress" --output text)
done
aws ec2 wait snapshot-completed takes a while to time out. This snippet uses aws ec2 describe-snapshots to get the progress. When it's 100% it calls snapshot-completed.
# create snapshot
SNAPSHOTID=$(aws ec2 create-snapshot --volume-id $VOLUMEID --output text --query "SnapshotId")
echo "Waiting for Snapshot ID: $SNAPSHOTID"
SNAPSHOTPROGRESS=$(aws ec2 describe-snapshots --snapshot-ids $SNAPSHOTID --query "Snapshots[*].Progress" --output text)
while [ $SNAPSHOTPROGRESS != "100%" ]
do
sleep 15
echo "Snapshot ID: $SNAPSHOTID $SNAPSHOTPROGRESS"
SNAPSHOTPROGRESS=$(aws ec2 describe-snapshots --snapshot-ids $SNAPSHOTID --query "Snapshots[*].Progress" --output text)
done
aws ec2 wait snapshot-completed --snapshot-ids "$SNAPSHOTID"
This is essentially the same thing as above, but prints out a progress message every 15 seconds. Snapshots that are completed return 100% immediately.
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html
You can set a variable or use the config file to increase the timeouts.
AWS_MAX_ATTEMPTS=100
~/.aws/config
[default]
retry_mode = standard
max_attempts = 6
ISSUE: In ci/cd we had command to wait ecs service to be steady and got this error
aws ecs wait services-stable \
--cluster MyCluster \
--services MyService
ERROR MSG : Waiter ServicesStable failed: Max attempts exceeded
FIX
in order to fix this issue we followed this doc
-> https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/load-balancer-healthcheck.html
aws elbv2 modify-target-group --target-group-arn <arn of target group> --healthy-threshold-count 2 --health-check-interval-seconds 5 --health-check-timeout-seconds 4
-> https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/load-balancer-connection-draining.html
aws elbv2 modify-target-group-attributes --target-group-arn <arn of target group> --attributes Key=deregistration_delay.timeout_seconds,Value=10
this fixed the issue
In case you have more target groups to edit just output the target groups arns to a file and run this in a loop.

AWS Auto scaling groups and non ELB health checks

We have auto scaling groups for one of our cloud formation stacks that has a CPU based alarm for determining when to scale the instances.
This is great but we recently had it scale up from one node to three and one of those nodes failed to bootstrap via cfn-init. Once the workload reduced and the group scaled back down to one node it killed the two good instances and left the partially bootstrapped node as the only remaining instance. This meant that we stopped processing work until someone logged in and re-ran the bootstrap process.
Obviously this is not ideal. What is the best way to notify the auto scaling group that a node is not healthy when it does not sit behind an ELB?
Since this is just initial bootstrap what I'd really like is to communicate back to the auto scaling group that this node failed and have it terminated and a new node spun up in its place.
A colleague just showed me http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html which looks handy.
If you have your own health check system, you can use the information from your health check system to set the health state of the instances in the Auto Scaling group.
UPDATE - I managed to get this working during launch.
Here's what my UserData section for the ASG looks like:
#!/bin/bash -v
set -x
export AWS_DEFAULT_REGION=us-west-1
cfn-init --region us-west-1 --stack bapi-prod --resource LaunchConfiguration -v
if [[ $? -ne 0 ]]; then
export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`
aws autoscaling set-instance-health \
--instance-id $INSTANCE \
--health-status Unhealthy
fi
cfn-init --region us-west-1 --stack bapi-prod --resource LaunchConfiguration -v
if [[ $? -ne 0 ]]; then
export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`
aws autoscaling set-instance-health \
--instance-id $INSTANCE \
--health-status Unhealthy
fi
Can also be done as a one-liner. For example, I'm using the following in Terraform:
runcmd:
- /tmp/runcmd-puppet.sh || { export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`; aws autoscaling --region eu-west-1 set-instance-health --instance-id $INSTANCE --health-status Unhealthy; }