How to avoid downtime when restarting ECS Fargate services? - amazon-web-services

I have this bash script:
#!/bin/bash
myClusterId="myCluster"
for service in $(aws ecs list-services --cluster $myClusterId --query "serviceArns[*]" | jq -r 'to_entries[] | .value | sub(".*/";"")'); do
for task in $( aws ecs list-tasks --cluster $myClusterId --service-name $service --desired-status 'RUNNING' --no-paginate --output text --query 'taskArns[*]' ) ; do
aws ecs stop-task --cluster $myClusterId --task $task --reason "Restarted using bash script" > /dev/null 2>&1
done
done
In short, it will restart all my ECS Fargate tasks under myCluster (excluding scheduled tasks triggered by CloudWatch Rules). It's working fine so far.
All my services have minHealthyPercent set to 100 and maxHealthyPercent set to 200. But, I noticed that it didn't keep any healthy tasks during the restart process. All tasks get killed immedietly and my load balancer throws 503 Service Temporarily Unavailable error when new tasks are in pending/provisioning state.
Am I missing something in my script? How do I correctly perform no-downtime services restart process using AWS CLI?

The parameters maximumPercent and minimumHealthyPercent are only used during rolling updates of your ECS service:
The number of tasks that Amazon ECS adds or removes from the service during a rolling update is controlled by the deployment configuration. A deployment configuration consists of the minimum and maximum number of tasks allowed during a service deployment.
Restarting a task is not considered as a new deployment.
To rectify the issue, there are few choices:
include a sleep in your for loop. Its the crudest way, but fastest to implement for testing.
use describe-tasks in the for loop to pull the state of the task just terminated. Proceed with restarting next task only when the state of the most recently restarted one will be RUNNING.

I think your best option would be to do a blue/green deployment through CodeDeploy, assuming you use an Elastic Load Balancer. The blue/green deployment will automatically detect any error and stop the deployment if required.
https://aws.amazon.com/blogs/devops/use-aws-codedeploy-to-implement-blue-green-deployments-for-aws-fargate-and-amazon-ecs/

Related

Running AWS ECS Task Attached (Not Detached)

Is there easy way to run an ECS Task attached or to follow the logs only while the container is Running (ie. Detach after displaying all of the logs associated)?
Using the AWS CLI (1.17.0) and ecs-cli (1.21.0), I have gotten decently close with the following two commands:
aws ecs run-task --cluster "mycluster" --task-definition testhelloworldjob --launch-type FARGATE --network-configuration etc.etc.etc.
ecs-cli logs --task-id {TASK_ID_HERE_FROM_OUTPUT_OF_PREVIOUS_COMMAND} --follow
I am currently have two issues with the above approach:
There is a race condition being that the logs are not available when the task is in a pre "running" state. Instead of ecs-cli logs waiting for the logs to exist, there is an error immediately thrown.
Even after waiting for the task to be in a running state, and issuing the ecs-cli logs the command refuses to detach even AFTER the task is finished and in a Post Running status.
For the first issue I could poll until there is a post activating/pending status, prior to calling logs. For the second issue I could draft some type of threaded call that would poll to stop the following of a log after the container in question is no longer running.... But there has to be an easier way?
To clarify I am coming from numerous other container orchestration tools/technologies that seemingly supported this very seamlessly. Here are some examples of tools and their associated commands that would yield me my intended results:
Docker CLI:
docker run hello-world
Docker-Compose Yaml:
docker-compose up
K8 Kubectl Yaml:
kubectl apply -f ./hello-k8.yaml && kubectl logs --follow hello-world
I think ecs-cli is the best option available at the moment.
Apart from that, you can change the logs driver of the AWS ECS task to syslog and then watch the logs file from the terminal after doing SSH into the EC2 container instance in which it is running.
Another thing you can do is SSH into the EC2 container instance in which it was running before and then run the container of that AWS ECS task by yourself in it using docker run and once the testing is done, you can stop and remove that container and then get that task started via AWS ECS.
Note: You can use AWS SSM Session Manager in order to avoid using EC2 key pair and adding an inbound rule for SSH.

AWS ECS CI/CD Downtime

I'm trying to understand from Tutorials and AWS Documentation how the ECS Rolling update works. I'm really confused because I can't solve a small problem.
I am using Gitlab CI/CD and deploying in ECS. I do not use Load Balancer yet, but I think I have to because I'm having a downtime problem. I am using only 1 task for our application, I've set the min and max healthy percent to 0/200 (I think it's false) and when I deploy using:
- aws ecs update-service --region "${REGION}" --cluster "${CLUSTER_NAME}" --service "${SERVICE_NAME}" --task-definition "$TASK_DEFINTION_NAME":${REV} --force-new-deployment
It firsts stops the running task and then starts a new task. Until the service reaches a steady state I can't access my website! Its like 30-40 sec Downtime.
How can I solve this? Should I use Blue/Green Deployment or I am doing something wrong?

How to simply "recycle/reboot" running tasks in AWS ECS with no change in image/source-code?

I have simple query : what is the best way to simply recycle/reboot a service having 2 tasks using AWS ECS console without any actual change being deployed ?
Currently I need to update service and set tasks count from 2 to 0 and wait for tasks to drain out. Then I will set tasks count from 0 to 2 to bring it up. This is how recycle/reboot 2 tasks of a service.
I need to do this sometimes due to internal app error and just want to reboot them without any actual change which resolves my problem.
AWS provides one option (Force new deployment checkbox) which is not helping and it works if there is a change in image ? Wish AWS could provide one option as "Recycle a service(tasks)" which will start 2 new tasks and drain out 2 existing tasks.
What could be the best and easiest way do it using AWS Console or even AWS API/CLI ?
If you stop the tasks, ECS will launch new ones to satisfy the desired count. That's fairly easy in the ECS console, just select the tasks in the list of tasks and choose Stop in the Action dropdown.
Using the aws CLI you can get a list of the tasks to kill using:
aws ecs list-tasks --service-name my-service
to delete each task use:
aws ecs stop-task --task %1
where %1 is the ARN of the task as provided by the first command.
Here is a command that combines both commands above. It will kill all the tasks of a given service:
SVC=your-service-name-here
aws ecs list-tasks --service-name $SVC --output text | cut -f2 | perl -ne 'system("aws ecs stop-task --task $_")'

Can I pause an ECS service instead of deleting it?

Haven't been able to find this in docs. Can I just pause an ECS service so it stopped creating new tasks? Or do I have to delete it to stop that behavior?
I just want to temporarily suspend it from creating new tasks on the cluster
It is enough to set the desired number of tasks for a service to 0.
ECS will automatically remove all running tasks.
aws ecs update-service --desired-count 0 --cluster "ecs-my-ClusterName" --service "service-my-ServiceName-117U7OHVC5NJP"
You can accomplish a "pause" by adjusting your service configuration to match your current number of running tasks. For example, if you currently have 3 running tasks in your service, you'd configure the service as below:
This tells the service:
The number of tasks I want is [current-count]
I want you to maintain at least [current-count]
I don't want more than [current-count
These combined effectively halt your service from making any changes.
The accepted answer is incorrect.
If you set both "Minimum healthy percent" and "Maximum healthy percent" to 100, AWS will give you an error similar to following:
To stop service from creating new tasks, you have to update service by updating task definition and setting desired number of tasks to 0. After that you can use AWS CLI (fastest option) to stop existing running tasks , for example:
aws ecs list-services --cluster "ecs-my-ClusterName"
aws ecs list-tasks --cluster "ecs-my-ClusterName" --service "service-my-ServiceName-117U7OHVC5NJP"
After that you will get the list of the running tasks for the service, such as:
{
"taskArns": [
"arn:aws:ecs:us-east-1:XXXXXXXXXXX:task/12e13d93-1e75-4088-a7ab-08546d69dc2c",
"arn:aws:ecs:us-east-1:XXXXXXXXXXX:task/35ed484a-cc8f-4b5f-8400-71e40a185806"
]
}
Finally use below to stop each task:
aws ecs stop-task --cluster "ecs-my-ClusterName" --task 12e13d93-1e75-4088-a7ab-08546d69dc2c
aws ecs stop-task --cluster "ecs-my-ClusterName" --task 35ed484a-cc8f-4b5f-8400-71e40a185806
UPDATE: By setting the desired number of running tasks to 0, ECS will stop and drain all running tasks in that service. There is no need to stop them individually afterwards using CLI commands originally posted above.

AWS ECS restart Service with the same task definition and image with no downtime

I am trying to restart an AWS service (basically stop and start all tasks within the service) without making any changes to the task definition.
The reason for this is because the image has the latest tag attached with every build.
I have tried stopping all tasks and having the services recreate them but this means that there is some temporarily unavailable error when the services are restarting in my instances (2).
What is the best way to handle this? Say, A blue-green deployment strategy so that there is no downtime?
This is what I have currently. It'shortcomings is that my app will be down for a couple of seconds as the service's tasks are being rebuilt after deleting them.
configure_aws_cli(){
aws --version
aws configure set default.region us-east-1
aws configure set default.output json
}
start_tasks() {
start_task=$(aws ecs start-task --cluster $CLUSTER --task-definition $DEFINITION --container-instances $EC2_INSTANCE --group $SERVICE_GROUP --started-by $SERVICE_ID)
echo "$start_task"
}
stop_running_tasks() {
tasks=$(aws ecs list-tasks --cluster $CLUSTER --service $SERVICE | $JQ ".taskArns | . []");
tasks=( $tasks )
for task in "${tasks[#]}"
do
[[ ! -z "$task" ]] && stop_task=$(aws ecs stop-task --cluster $CLUSTER --task "$task")
done
}
push_ecr_image(){
echo "Push built image to ECR"
eval $(aws ecr get-login --region us-east-1)
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/repository:$TAG
}
configure_aws_cli
push_ecr_image
stop_running_tasks
start_tasks
Use update-service and the --force-new-deployment flag:
aws ecs update-service --force-new-deployment --service my-service --cluster cluster-name
Hold on a sec.
If I understood you usecase correctly, this is addressed in the official docs:
If your updated Docker image uses the same tag as what is in the existing task definition for your service (for example, my_image:latest), you do not need to create a new revision of your task definition. You can update the service using the procedure below, keep the current settings for your service, and select Force new deployment....
To avoid downtime, you should manipulate 2 parameters: minimum healthy percent and maximum percent:
For example, if your service has a desired number of four tasks and a maximum percent value of 200%, the scheduler may start four new tasks before stopping the four older tasks (provided that the cluster resources required to do this are available). The default value for maximum percent is 200%.
This basically mean, that regardless of whether your task definition changed and to what extent, there can be an "overlap" between the old and the new ones, and this is the way to achieve resilience and reliability.
UPDATE:
Amazon has just introduced External Deployment Controllers for ECS(both EC2 and Fargate). It includes a new level of abstraction called TaskSet. I haven't tried it myself yet, but such fine grain control over service and task management(both APIs are supported) can potentially solve the problem akin this one.
After you push your new image to your Docker repository, you can create a new revision of your task definition (it can be identical to the existing task definition) and update your service to use the new task definition revision. This will trigger a service deployment, and your service will pull the new image from your repository.
This way your task definition stays the same (although updating the service to a new task definition revision is required to trigger the image pull), and still uses the "latest" tag of your image, but you can take advantage of the ECS service deployment functionality to avoid downtime.
The fact that I have to create a new revision of my task definition every time even when there is no change in the task definition itself is not right.
There are a bunch of crude bash implementations on this which means that AWS should have the ECS service scheduler listen for changes/updates in the image, especially for an automated build process.
My crude work-around to this was have two identical task definitions and switch between them for every build. That way I don't have redundant revisions.
Here is the specific script snippet that does that.
update_service() {
echo "change task definition and update service"
taskDefinition=$(aws ecs describe-services --cluster $CLUSTER --services $SERVICE | $JQ ".services | . [].taskDefinition")
if [ "$taskDefinition" = "$TASK_DEF_1" ]; then
newDefinition="$TASK_DEF_2"
else
newDefinition="$TASK_DEF_1"
fi
rollUpdate=$(aws ecs update-service --cluster $CLUSTER --service $SERVICE --task-definition $newDefinition)
}
Did you have this question solved? Perhaps this will work for you.
With a new release image pushed to ECR with a version tag, i.e. v1.05, and the latest tag, the image locator in my task definition needed to be explicitly updated to have this version tag postfixed like :v1.05.
With :latest, this new image did not get pulled by the new container after aws ecs update-service --force-new-deployment --service my-service.
I was doing tagging and pushing like this:
docker tag ${imageId} ${ecrRepoUri}:v1.05
docker tag ${imageId} ${ecrRepoUri}:latest
docker push ${ecrRepoUri}
...where as this is the proper way of pushing multiple tags:
docker tag ${imageId} ${ecrRepoUri}
docker push ${ecrRepoUri}:v1.05
docker push ${ecrRepoUri}:latest
This was briefly mentioned in the official docs without a proper example.
Works great https://github.com/fdfk/ecsServiceRestart
python ecsServiceRestart.py restart --services="app app2" --cluster=test
The quick and dirty way:
login to EC2 instance running the task
find your container with docker container list
use docker restart [container]