AWS Auto scaling groups and non ELB health checks - amazon-web-services

We have auto scaling groups for one of our cloud formation stacks that has a CPU based alarm for determining when to scale the instances.
This is great but we recently had it scale up from one node to three and one of those nodes failed to bootstrap via cfn-init. Once the workload reduced and the group scaled back down to one node it killed the two good instances and left the partially bootstrapped node as the only remaining instance. This meant that we stopped processing work until someone logged in and re-ran the bootstrap process.
Obviously this is not ideal. What is the best way to notify the auto scaling group that a node is not healthy when it does not sit behind an ELB?
Since this is just initial bootstrap what I'd really like is to communicate back to the auto scaling group that this node failed and have it terminated and a new node spun up in its place.

A colleague just showed me http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html which looks handy.
If you have your own health check system, you can use the information from your health check system to set the health state of the instances in the Auto Scaling group.
UPDATE - I managed to get this working during launch.
Here's what my UserData section for the ASG looks like:
#!/bin/bash -v
set -x
export AWS_DEFAULT_REGION=us-west-1
cfn-init --region us-west-1 --stack bapi-prod --resource LaunchConfiguration -v
if [[ $? -ne 0 ]]; then
export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`
aws autoscaling set-instance-health \
--instance-id $INSTANCE \
--health-status Unhealthy
fi

cfn-init --region us-west-1 --stack bapi-prod --resource LaunchConfiguration -v
if [[ $? -ne 0 ]]; then
export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`
aws autoscaling set-instance-health \
--instance-id $INSTANCE \
--health-status Unhealthy
fi
Can also be done as a one-liner. For example, I'm using the following in Terraform:
runcmd:
- /tmp/runcmd-puppet.sh || { export INSTANCE=`curl http://169.254.169.254/latest/meta-data/instance-id`; aws autoscaling --region eu-west-1 set-instance-health --instance-id $INSTANCE --health-status Unhealthy; }

Related

Set Desired Capacity error during the bootstrap of the instance

I've created an ASG with a min size and desired capacity set to 1. The EC2 instance is bind to an Application Load Balancer. I use ignition to define the user data of the Launch Configuration. I run defined in Ignition a script which execute these two commands:
# Set the ASG Desired Capacity - get CoreOS metadata
ASG_NAME=$(/usr/bin/docker run --rm --net=host \
"$AWSCLI_IMAGE" aws autoscaling describe-auto-scaling-instances \
--region="$COREOS_EC2_REGION" --instance-ids="$COREOS_EC2_INSTANCE_ID" \
--query 'AutoScalingInstances[].AutoScalingGroupName' --output text)
echo "Check desired capacity of Auto Scaling group..."
# shellcheck disable=SC2154,SC2086
/usr/bin/docker run --rm --net=host \
$AWSCLI_IMAGE aws autoscaling set-desired-capacity \
--region="$COREOS_EC2_REGION" --auto-scaling-group-name "$ASG_NAME" \
--desired-capacity 3 \
--honor-cooldown
The problem is that I get as error ScalingActivityInProgress so I can't change the desired capacity.
First I'd like to understand the root cause. Is it maybe because the ALB is not healthy when I run the above commands?
Solved removing honor-cooldown param from the request

Scale-in Concourse workers without killing tasks

Each worker runs multiple tasks. If we have a lot of tasks we'll need multiple workers. In order to save resources we'd like to elastically scale workers in and out, according to supply (spare capacity) and demand (pending tasks).
Scaling out is easy: add more nodes, they register themselves with TSA and start working.
Scaling in is trickier: one needs to wait for its tasks to finish before killing a worker's instance. Otherwise they'll have to restart on another worker. That's fine for small tasks but for longer ones that might not be acceptable.
One possible solution on AWS would be to use Autoscaling Lifecycle Hooks to synchronously tell the worker to not accept any more tasks and return when all are finished, then kill it. The Concourse Worker API doesn't have any such operation though.
Is there a way to implement safe scaling in of Concourse workers?
If the answer is "don't worry, Bosh will take care of it" I'd like to know what those mechanics are as I probably won't be using it.
You have to use the concourse binary from the command-line, on the host that runs the ATC (which is the concourse scheduler and web interface):
concourse --help
Usage:
concourse [OPTIONS] <command>
Application Options:
-v, --version Print the version of Concourse and exit [$CONCOURSE_VERSION]
Help Options:
-h, --help Show this help message
Available commands:
land-worker Safely drain a worker's assignments for temporary downtime.
retire-worker Safely remove a worker from the cluster permanently.
web Run the web UI and build scheduler.
worker Run and register a worker.
So it looks like you could hook to the Autoscaling Lifecycle service something that calls land-worker and then retire-worker (not sure whether retire-worker would be enough), once you figure out which worker you want to spin down...
When you spin the same worker back, you might have to be careful with the worker name, I seems to remember that sometimes the ATC gets confused, you will have to experiment with that (whether you can keep the same name or change it).
You can create a Lifecyle hook on your Concourse worker ASG:
Type: AWS::AutoScaling::LifecycleHook
Properties:
AutoScalingGroupName: !Ref ConcourseWorkerASG
DefaultResult: CONTINUE / ABANDON
HeartbeatTimeout: 900 # 15 minutes for example
LifecycleHookName: lchname
LifecycleTransition: "autoscaling:EC2_INSTANCE_TERMINATING"
Use a script to retire the worker, something along the lines of
lch.sh
#!/bin/bash
TYPE=$(cat /opt/concourse/type)
tsa_host=zz
instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id/)
lifecycleState=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].LifecycleState' --output text --region eu-west-1)
if [ "$TYPE" == "worker" ]; then
if [ "$lifecycleState" == "Terminating:Wait" ]; then
asg=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text --region eu-west-1)
/opt/concourse/concourse/bin/concourse retire-worker \
--name $HOSTNAME \
--tsa-host ${tsa_host}:2222 \
--tsa-public-key some_tsa_host_key.pub \
--tsa-worker-private-key some_worker_key
sleep 5m
systemctl stop your_concourse_service
aws autoscaling complete-lifecycle-action \
--instance-id $instance_id \
--auto-scaling-group-name $asg \
--lifecycle-hook-name "lchname" \
--lifecycle-action-result "CONTINUE" \
--region eu-west-1
fi
fi
then schedule a cronjob, for example via Ansible:
- name: List lch.sh as cronjob
cron:
name: "check asg lch for retiring the worker"
minute: "*/5" # run every 5 minutes
job: "/opt/concourse/lch.sh"

How do you delete an AWS ECS Task Definition?

Once you've created a task definition in Amazon's EC2 Container Service, how do you delete or remove it?
It's a known issue. Once you de-register a Task Definition it goes into INACTIVE state and clutters up the ECS Console.
If you want to vote for it to be fixed, there is an issue on Github. Simply give it a thumbs up, and it will raise the priority of the request.
I've recently found this gist (thanks a lot to the creator for sharing!) which will deregister all task definitions for your specific region - maybe you can adapt it to skip some which you want to keep: https://gist.github.com/jen20/e1c25426cc0a4a9b53cbb3560a3f02d1
You need to have jq to run it:
brew install jq
I "hard-coded" my region, for me it's eu-central-1, so be sure to adapt it for your use-case:
#!/usr/bin/env bash
get_task_definition_arns() {
aws ecs list-task-definitions --region eu-central-1 \
| jq -M -r '.taskDefinitionArns | .[]'
}
delete_task_definition() {
local arn=$1
aws ecs deregister-task-definition \
--region eu-central-1 \
--task-definition "${arn}" > /dev/null
}
for arn in $(get_task_definition_arns)
do
echo "Deregistering ${arn}..."
delete_task_definition "${arn}"
done
Then when I run it, it starts removing them:
Deregistering arn:aws:ecs:REGION:YOUR_ACCOUNT_ID:task-definition/NAME:REVISION...
Oneline approach inspired by Anna A reply:
aws ecs list-task-definitions --region eu-central-1 \
| jq -M -r '.taskDefinitionArns | .[]' \
| xargs -I {} aws ecs deregister-task-definition \
--region eu-central-1 \
--task-definition {} \
| jq -r '.taskDefinition.taskDefinitionArn'
There is no option to delete a task definition on the AWS console.
But, you can deregister (delete) a task definition by executing the following command number of revisions that you have:
aws ecs deregister-task-definition --task-definitiontask_defination_name:revision_no
Created following gist to safely review, filter and deregister AWS task-definitions and revisions in bulk (max 100 at a time) using JS CLI.
https://gist.github.com/shivam-nagar/aa79b02b74f616f8714d51e419bd10de
Can use this to deregister all revisions for task-definition. This will result in task-definition itself marked as inactive.
Now its supported
I just went inside the Task Definations and clicked on Actions and click on Deregister and it was removed from the UI

aws command line interface - aws ec2 wait - Max attempts exceeded

I am working on shell script, witch does follow:
creates snapshot of EBS Volume;
creates AMI image based on this snapshot.
1) I use follow command to create snapshot:
SNAPSHOT_ID=$(aws ec2 create-snapshot "${DRYRUN}" --volume-id "${ROOT_VOLUME_ID}" --description "${SNAPSHOT_DESCRIPTION}" --query 'SnapshotId')
2) I use waiter to wait complete state:
aws ec2 wait snapshot-completed --snapshot-ids "${SNAPSHOT_ID}"
When I test it with EBS Volume 8 GB size everything goes well.
When it is 40 GB, I have an exception:
Waiter SnapshotCompleted failed: Max attempts exceeded
Probably, 40 GB takes more time, then 8 GB one, just need to wait.
AWS Docs (http://docs.aws.amazon.com/cli/latest/reference/ec2/wait/snapshot-completed.html) don't have any timeout or attempts quantity option.
May be some of you have faced the same issue?
So, finally, I used follow way to solve it:
Create snapshot
Use loop to check exit status of command aws ec2 wait snapshot-completed
If exit status is not 0 then print current state, progress and run waiter again.
# Create snapshot
SNAPSHOT_DESCRIPTION="Snapshot of Primary frontend instance $(date +%Y-%m-%d)"
SNAPSHOT_ID=$(aws ec2 create-snapshot "${DRYRUN}" --volume-id "${ROOT_VOLUME_ID}" --description "${SNAPSHOT_DESCRIPTION}" --query 'SnapshotId')
while [ "${exit_status}" != "0" ]
do
SNAPSHOT_STATE="$(aws ec2 describe-snapshots --filters Name=snapshot-id,Values=${SNAPSHOT_ID} --query 'Snapshots[0].State')"
SNAPSHOT_PROGRESS="$(aws ec2 describe-snapshots --filters Name=snapshot-id,Values=${SNAPSHOT_ID} --query 'Snapshots[0].Progress')"
echo "### Snapshot id ${SNAPSHOT_ID} creation: state is ${SNAPSHOT_STATE}, ${SNAPSHOT_PROGRESS}%..."
aws ec2 wait snapshot-completed --snapshot-ids "${SNAPSHOT_ID}"
exit_status="$?"
done
If you have something that can improve it, please share with us.
you should probably use until in bash, looks a bit cleaner and you don't have to repeat.
echo "waiting for snapshot $snapshot"
until aws ec2 wait snapshot-completed --snapshot-ids $snapshot 2>/dev/null
do
do printf "\rsnapshot progress: %s" $progress;
sleep 10
progress=$(aws ec2 describe-snapshots --snapshot-ids $snapshot --query "Snapshots[*].Progress" --output text)
done
aws ec2 wait snapshot-completed takes a while to time out. This snippet uses aws ec2 describe-snapshots to get the progress. When it's 100% it calls snapshot-completed.
# create snapshot
SNAPSHOTID=$(aws ec2 create-snapshot --volume-id $VOLUMEID --output text --query "SnapshotId")
echo "Waiting for Snapshot ID: $SNAPSHOTID"
SNAPSHOTPROGRESS=$(aws ec2 describe-snapshots --snapshot-ids $SNAPSHOTID --query "Snapshots[*].Progress" --output text)
while [ $SNAPSHOTPROGRESS != "100%" ]
do
sleep 15
echo "Snapshot ID: $SNAPSHOTID $SNAPSHOTPROGRESS"
SNAPSHOTPROGRESS=$(aws ec2 describe-snapshots --snapshot-ids $SNAPSHOTID --query "Snapshots[*].Progress" --output text)
done
aws ec2 wait snapshot-completed --snapshot-ids "$SNAPSHOTID"
This is essentially the same thing as above, but prints out a progress message every 15 seconds. Snapshots that are completed return 100% immediately.
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html
You can set a variable or use the config file to increase the timeouts.
AWS_MAX_ATTEMPTS=100
~/.aws/config
[default]
retry_mode = standard
max_attempts = 6
ISSUE: In ci/cd we had command to wait ecs service to be steady and got this error
aws ecs wait services-stable \
--cluster MyCluster \
--services MyService
ERROR MSG : Waiter ServicesStable failed: Max attempts exceeded
FIX
in order to fix this issue we followed this doc
-> https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/load-balancer-healthcheck.html
aws elbv2 modify-target-group --target-group-arn <arn of target group> --healthy-threshold-count 2 --health-check-interval-seconds 5 --health-check-timeout-seconds 4
-> https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/load-balancer-connection-draining.html
aws elbv2 modify-target-group-attributes --target-group-arn <arn of target group> --attributes Key=deregistration_delay.timeout_seconds,Value=10
this fixed the issue
In case you have more target groups to edit just output the target groups arns to a file and run this in a loop.

Stop and Start Elastic Beanstalk Services

I wanted to know if there is an option to STOP Amazon Elastic Beanstalk as an atomic unit as I can do with EC2 servers instead of going through each service (e.g. load balancer, EC2..) and STOP (and START) them independently?
The EB command line interface has an eb stop command. Here is a little bit about what the command actually does:
The eb stop command deletes the AWS resources that are running your application (such as the ELB and the EC2 instances). It however leaves behind all of the application versions and configuration settings that you had deployed, so you can quickly get started again. Eb stop is ideal when you are developing and testing your application and don’t need the AWS resources running over night. You can get going again by simply running eb start.
EDIT:
As stated in the below comment, this is no longer a command in the new eb-cli.
If you have a load-balanced environment you can try the following trick
$ aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-auto-scaling-group \
--min-size 0 --max-size 0 --desired-capacity 0
It will remove all instances from the environment but won't delete the environment itself. Unfortunately you still will pay for elastic load balancer. But usually EC2 is the most "heavy" part.
Does it work for 0?
yes, it does
$ aws autoscaling describe-auto-scaling-groups --region us-east-1 \
--auto-scaling-group-name ASG_NAME \
--query "AutoScalingGroups[].{DesiredCapacity:DesiredCapacity,MinSize:MinSize,MaxSize:MaxSize}"
[
{
"MinSize": 2,
"MaxSize": 2,
"DesiredCapacity": 2
}
]
$ aws autoscaling update-auto-scaling-group --region us-east-1 \
--auto-scaling-group-name ASG_NAME \
--min-size 0 --max-size 0 --desired-capacity 0
$ aws autoscaling describe-auto-scaling-groups --region us-east-1 \
--auto-scaling-group-name ASG_NAME \
--query "AutoScalingGroups[].{DesiredCapacity:DesiredCapacity,MinSize:MinSize,MaxSize:MaxSize}"
[
{
"MinSize": 0,
"MaxSize": 0,
"DesiredCapacity": 0
}
]
And then you can check environment status
$ eb status -v
Environment details for: test
Application name: TEST
Region: us-east-1
Deployed Version: app-170925_181953
Environment ID: e-1234567890
Platform: arn:aws:elasticbeanstalk:us-east-1::platform/Multi-container Docker running on 64bit Amazon Linux/2.7.4
Tier: WebServer-Standard
CNAME: test.us-east-1.elasticbeanstalk.com
Updated: 2017-09-25 15:23:22.980000+00:00
Status: Ready
Health: Grey
Running instances: 0
In the beanstalk webconsole you will see the following message
INFO Environment health has transitioned from Ok to No Data.
There are no instances. Auto Scaling group desired capacity is set to zero.
eb stop is deprecated. I also had the same problem and the only solution I could come up with was to backup the environment and then restore it.
Here's a blog post in which I'm explaining it:
http://pminkov.github.io/blog/how-to-shut-down-and-restore-an-elastic-beanstalk-environment.html