I have ESC service with EC2 task running on an EC2 instance. The image in the EC2 task is from the ECR uri: for example: 688523422345.dkr.ecr.us-west-1.amazonaws.com/image, I noticed that when I load this image into my EC2 task I just directly using the uri:688523422345.dkr.ecr.us-west-1.amazonaws.com/image:latest, because the image uri never changes and I just keep push image to update it.
However, when the image did updated on ECR, the task and service running on EC2 instance doesn't updating. I wondering why, and search on stack overflow, people told me that using aws ecs update-service --cluster <cluster name> --service <service name> --force-new-deployment to force the service to re-deploy. However, I just got error on not enough memory left on the instance(seems the re-deployment will create new task and it keep taking more memories, not a good solution).
How can I solve this?
This could be because of your Deployment configuration and the parameters:
maximumPercent
minimumHealthyPercent
By default minimumHealthyPercent is 100% which means that replacement operation will first attempt to run new tasks, before terminating old ones, potentially resulting in out of memory error. You can set it up to minimumHealthyPercent to 0 and maximumPercent to 100 as to force termination of existing tasks first, before creating new ones.
It's not working, after tried a lot. I find out is because the EC2 instance already stored all the information from a task(even if deleted task, the instance is still running with the image). The right way to do it is to re-start the instance.
I used aws-cli: aws ec2 reboot-instances --instance-ids <instance_id>
It worked!
Related
I'm using ECS with Fargate, I have a service running and it is working OK.But after I update the task definition (new deploy) and the console (ECS -> Clusters -> Tasks tab) shows that my current task is INACTIVE, which is normal, but it doesn't show any new task being created, nor any stopped task, even after an hour. It is as if ECS is not trying to run my new definition.
If I use the awscli to find information about my service:
aws ecs describe-services --cluster cluster-xxxxxxx --services service-svc-xxxxxxx --region us-east-1
It has two deployments. The first is alright, it is the running deployment. The most recent one, it shows:
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 7,
...
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/XXXXXXXXXXXXXXXXXX in progress."
Again, there is nothing on the ECS console that points to failed tasks. It is as if the task is failing on a so premature state, its not even logging anything.
I tried looking at CloudTrail event, but there is nothing there about failed tasks. On CloudWatch, the logs for container insights (/aws/ecs/containerinsights/cluster-xxxxxxx/performance) also don't mention failing tasks.
How can I troubleshoot this situation?
Turns out, if the account/organization is new, then AWS apply quotas that seems not to be documented anywhere. ECS was not authorized to start more than two tasks.
There is a trick, I found on this post, that says that I had to create an EC2 instance on the same region I was trying to use ECS. Shortly after I created the EC2 instance, I received an email from AWS and everything behaved normally again. https://forums.aws.amazon.com/thread.jspa?threadID=340770
I really need to know about the stopped time of AWS EC2 instances. I have checked with AWS cloudtrail, but its not easy to find the exact stopped EC2 instance. Is possible to see exact time of stopped EC2 instances by aws-cli commands or any boto3 script?
You can get this info from StateTransitionReason in describe-instances AWS CLI when you search for stopped instances:
aws ec2 describe-instances --filter Name=instance-state-name,Values=stopped --query 'Reservations[].Instances[*].StateTransitionReason' --output text
Example output:
User initiated (2020-12-03 07:16:35 GMT)
AWS Config keeps track of the state of resources as they change over time.
From What Is AWS Config? - AWS Config:
AWS Config provides a detailed view of the configuration of AWS resources in your AWS account. This includes how the resources are related to one another and how they were configured in the past so that you can see how the configurations and relationships change over time.
Thus, you could look back through the configuration history of the Amazon EC2 instance and extract times for when the instance changed to a Stopped state.
Sometimes time is missing from StateTransitionReason, you can use CloudTrail and search for Resource Name = instance ID to find out StopInstance(s) API calls.
By default you can track back 90 days, or indefinitely if you create your own trail.
I have recently changed AMI on which my ECS EC2 instances are running from Amazon Linux to Amazon Linux 2 (in both cases I am using ECS optimized images). I am deploying my instances using cloudformation and having a real headache as those new instances sometimes are being run successfully and sometimes not (same stack, no updates, same code).
On the failed instances I see that there is an issue with ECS service itself after executing ecs-logs-collector.sh I see in ecs file log "warning: The Amazon ECS Container Agent is not running". Also directory "/var/log/ecs" doesn't even exist!.
I have correct IAM role attached to an instance.
Also as mentioned, it is the same code being run, and on 75% of attempts it fails with ECS service, I have no more ideas, where else to look for some issues/logs/errors.
AMI: ami-0650e7d86452db33b (eu-central-1)
Solved. If someone will fall into this issue adding this to my userdata helped:
cp /usr/lib/systemd/system/ecs.service /etc/systemd/system/ecs.service
sed -i '/After=cloud-final.service/d' /etc/systemd/system/ecs.service
systemctl daemon-reload
I want to make sure that the task is running the latest image.
Within the container, I can get the docker image ID (such as 183f9552f0a5) by calling http://169.254.170.2/v2/metadata, however I am looking for a way to get it on my laptop.
Is this possible with AWS CLI or SDK?
You first need to get the Task Definition ARN for the Task using describe_tasks. You can skip this step if you already know the ARN.
aws ecs describe-tasks --tasks TASK_ARN
Then you can use describe_task_definition to get the image name.
aws ecs describe-task-definition --task-definition TASKDEF_ARN
I am trying to restart an AWS service (basically stop and start all tasks within the service) without making any changes to the task definition.
The reason for this is because the image has the latest tag attached with every build.
I have tried stopping all tasks and having the services recreate them but this means that there is some temporarily unavailable error when the services are restarting in my instances (2).
What is the best way to handle this? Say, A blue-green deployment strategy so that there is no downtime?
This is what I have currently. It'shortcomings is that my app will be down for a couple of seconds as the service's tasks are being rebuilt after deleting them.
configure_aws_cli(){
aws --version
aws configure set default.region us-east-1
aws configure set default.output json
}
start_tasks() {
start_task=$(aws ecs start-task --cluster $CLUSTER --task-definition $DEFINITION --container-instances $EC2_INSTANCE --group $SERVICE_GROUP --started-by $SERVICE_ID)
echo "$start_task"
}
stop_running_tasks() {
tasks=$(aws ecs list-tasks --cluster $CLUSTER --service $SERVICE | $JQ ".taskArns | . []");
tasks=( $tasks )
for task in "${tasks[#]}"
do
[[ ! -z "$task" ]] && stop_task=$(aws ecs stop-task --cluster $CLUSTER --task "$task")
done
}
push_ecr_image(){
echo "Push built image to ECR"
eval $(aws ecr get-login --region us-east-1)
docker push $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/repository:$TAG
}
configure_aws_cli
push_ecr_image
stop_running_tasks
start_tasks
Use update-service and the --force-new-deployment flag:
aws ecs update-service --force-new-deployment --service my-service --cluster cluster-name
Hold on a sec.
If I understood you usecase correctly, this is addressed in the official docs:
If your updated Docker image uses the same tag as what is in the existing task definition for your service (for example, my_image:latest), you do not need to create a new revision of your task definition. You can update the service using the procedure below, keep the current settings for your service, and select Force new deployment....
To avoid downtime, you should manipulate 2 parameters: minimum healthy percent and maximum percent:
For example, if your service has a desired number of four tasks and a maximum percent value of 200%, the scheduler may start four new tasks before stopping the four older tasks (provided that the cluster resources required to do this are available). The default value for maximum percent is 200%.
This basically mean, that regardless of whether your task definition changed and to what extent, there can be an "overlap" between the old and the new ones, and this is the way to achieve resilience and reliability.
UPDATE:
Amazon has just introduced External Deployment Controllers for ECS(both EC2 and Fargate). It includes a new level of abstraction called TaskSet. I haven't tried it myself yet, but such fine grain control over service and task management(both APIs are supported) can potentially solve the problem akin this one.
After you push your new image to your Docker repository, you can create a new revision of your task definition (it can be identical to the existing task definition) and update your service to use the new task definition revision. This will trigger a service deployment, and your service will pull the new image from your repository.
This way your task definition stays the same (although updating the service to a new task definition revision is required to trigger the image pull), and still uses the "latest" tag of your image, but you can take advantage of the ECS service deployment functionality to avoid downtime.
The fact that I have to create a new revision of my task definition every time even when there is no change in the task definition itself is not right.
There are a bunch of crude bash implementations on this which means that AWS should have the ECS service scheduler listen for changes/updates in the image, especially for an automated build process.
My crude work-around to this was have two identical task definitions and switch between them for every build. That way I don't have redundant revisions.
Here is the specific script snippet that does that.
update_service() {
echo "change task definition and update service"
taskDefinition=$(aws ecs describe-services --cluster $CLUSTER --services $SERVICE | $JQ ".services | . [].taskDefinition")
if [ "$taskDefinition" = "$TASK_DEF_1" ]; then
newDefinition="$TASK_DEF_2"
else
newDefinition="$TASK_DEF_1"
fi
rollUpdate=$(aws ecs update-service --cluster $CLUSTER --service $SERVICE --task-definition $newDefinition)
}
Did you have this question solved? Perhaps this will work for you.
With a new release image pushed to ECR with a version tag, i.e. v1.05, and the latest tag, the image locator in my task definition needed to be explicitly updated to have this version tag postfixed like :v1.05.
With :latest, this new image did not get pulled by the new container after aws ecs update-service --force-new-deployment --service my-service.
I was doing tagging and pushing like this:
docker tag ${imageId} ${ecrRepoUri}:v1.05
docker tag ${imageId} ${ecrRepoUri}:latest
docker push ${ecrRepoUri}
...where as this is the proper way of pushing multiple tags:
docker tag ${imageId} ${ecrRepoUri}
docker push ${ecrRepoUri}:v1.05
docker push ${ecrRepoUri}:latest
This was briefly mentioned in the official docs without a proper example.
Works great https://github.com/fdfk/ecsServiceRestart
python ecsServiceRestart.py restart --services="app app2" --cluster=test
The quick and dirty way:
login to EC2 instance running the task
find your container with docker container list
use docker restart [container]