AWS fargate tasks won't start reliably - amazon-web-services

I have an ECS cluster with a bunch of different tasks in it (using the same docker image but with different environment variables).
Some of the tasks come up without problem but others fail a lot even though i've used the same VPC, subnet and security-group. The error message shows ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post https://api.ecr..
Bizarre is that the same task sometimes comes up if i create a new task definition or delete the ECR repository and re-upload the docker image.
I'm unable to draw any conclusion out of this..
Update: strange... the task starts successfully when i deregister the task definition and recreate it with the same specs. But only once..

It turns out one have to select the taskExecution role on Task Role - override and Task Execution Role - override in the run task Advanced Options section when starting the task. I don't know why it was arbitrarily working when randomly trying or working when i recreated the task definition every time.

Related

Where is ECS Task Stopped Reason now?

I am using the AWS interface to configure my services on ECS. Before the interface change, I used to be able to access a screen that would allow me to see why the task had failed (like in the example below), that interface could be accessed from the ECS service events by clicking on the taskid. Does anyone know how to get the task stopped reason data with the new interface?
You can see essentially the same message if you do the following steps:
Select your service from your ECS cluster:
Go to Configuration and tasks tab:
Scroll down and select a task. You would want to chose one which was stopped by the failing deployment:
You should have the Stopped reason message:

Port error while executing update on cloudformation

I changed some environment variables in the task definition part and executed the changeset.
The task definition got updated successfully but the update of service got stuck in cloudformation.
On checking the events in the cluster I found the following:
It is adding new task but the old one is already running consuming port so it is stuck. what can be done to resolve this. I can always delete and run the CF script again but I need to create a pipeline so I want the update stack to work.
This UPDATE_IN_PROGRESS will take around 3 hours until DescribeService API timeout.
If you can't wait then you need to manually force the state of the Amazon ECS service resource in AWS CloudFormation into a CREATE_COMPLETE state by
setting the desired count of the service to zero in the Amazon ECS console to stop running tasks. AWS CloudFormation then considers the update as successful, because the number of tasks equals the desired count of zero.
This blog explains the cause of the message and its fix in detail.
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-ecs-service-stabilize/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-service-stuck-update-status/?nc1=h_ls

AWS ECS task stuck in PROVISIONING state

My ECS cluster has 2 t3.xlarge instances and I created a service with 2 tasks, but the tasks remain in PROVISIONING state. Container is running on port 5020 and it is mapped to host port 5040. There are other services running on the same cluster with port mapping 5020:5020 and its working fine.
Should I make any changes to move the task to PENDING/RUNNING state?
I can share one problem and solution scenario, which is not commonly described and you cannot easily find answer, as there is no error message.
The ECS Service is in ACTIVE status, but it's Task is stuck in Last status PROVISIONING and Desired status RUNNING and
Health Status UNKNOWN and zero logs in CloudWatch, thus you have not option to look by Error message.
It appears, that Developers have not pushed Docker Container Image to corresponding AWS ECR! That was it!
ECR repository is empty, does not contain any image to send, thus AWS ECS (Fargate) Service is stuck in PROVISIONING state, as it is waiting for available container image to download.

AWS ECS: Monitoring the status of a service update

I am trying to migrate a set of microservices from Docker Swarm, to AWS ECS using Fargate.
I have created an ECS cluster. Moreover, I have initialized repositories using the ECR, each of which contains an image of a microservice.
I have successfully came up with a way to create new images, and push them into the ECR. In fact, with each change in the code, a new docker image is built, tagged, and pushed.
Moreover, I have created a task definition that is linked to a service. This task definition contains one container, and all the necessary information. Moreover, its service defines that the task will run in a VPC, and is linked to a load balancer, and has a target group. I am assuming that every new deployment uses the image with the "latest" tag.
So far with what I have explained, everything is clear and is working well.
Below is the part that is confusing me. After every new build, I would like to update the service in order for new tasks with the update image get deployed. I am using the cli to do so with the following command:
aws ecs update-service --cluster <cluster-name> --service <service-name>
Typically, after performing the command, I am monitoring the deployment logs, under the event tab, and checking the state of the service using the following command:
aws ecs describe-services --cluster <cluster-name> --service <service-name>
Finally, I tried to simulate a case where the newly created image contains a bad code. Thus, the new tasks will not be able to get deployed. What I have witnessed is that Fargate will keep trying (without stopping) to deploy the new tasks. Moreover, aside the event logs, the describe-services command does not contain relevant information, other than what Fargate is doing (e.g., registering/deregistering tasks). I am surprised that I could not find any mechanism that instructs Fargate, or the service to stop the deployment and rollback to the already existing one.
I found this article (https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/ ), which provides a solution. However, it is a fairly complicated one, and assumes that each new deployment is triggered by a new task definition, which is not what I want.
Therefore, considering what I have described above, I hope you can answer the following questions:
1) Using CLI commands (For automation purposes) Is there a way to instruct Fargate to automatically stop the current deployment, after failing to deploy new tasks after a few tries?
2) Using the CLI commands, is there a way to monitor the current status of the deployment? For instance, when performing a service update on a service on Docker swarm, the terminal generates live logs on the update process
3) After a failed deployment, is there a way for Fargate to signal an error code, or flag, or message?
At the moment, ECS does not offer deployment status directly. Once you issue a deployment, there is no way to determine its status other than to continually poll for updates until you have enough information to infer from them. Plus unexpected container exits are not logged anywhere. You have to search through failed tasks. The way I get them is by cloudwatch rule that triggers a lambda upon task state change.
I recommend you read: https://medium.com/#aaron.kaz.music/monitoring-the-health-of-ecs-service-deployments-baeea41ae737
As of now, you have a way to do this:
aws ecs wait services-stable --cluster MyCluster --services MyService
The previous example pauses and continues only after it can confirm that the service running on the cluster is stable. Will return 255 exit code after 40 failed checks.
To cancel a deployment, enable ECS Circuit Breaker when creating your service:
aws ecs create-service \
--service-name MyService \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true}" \
{...}
References:
Service deployment check.
Circuit Breaker

AWS Fargate 503 Service Temporarily Unavailable

I'm trying to deploy backend application to the AWS Fargate using cloudformation templates that I found. When I was using the docker image training/webapp I was able to successfully deploy it and access with the externalUrl from the networking stack for the app.
When I try to deploy our backend image I can see the stacks are deploying correctly but when I try to go to the externalUrl I get 503 Service Temporarily Unavailable and I'm unable to see it... Another thing that I've noticed is on the docker hub I can see that the image is continuously pulled all the time when the cloudformation services are running...
The backend is some kind of maven project I don't know exactly what but I know that locally its working but to get it up running the container with this backend image takes about 8 minutes... I'm not sure if this affects the Fargate ?? Any Idea how to get it working ?
It sounds like you need to find the actual error that you're experiencing, the 503 isn't enough information. Can you provide some other context?
I'm not familiar with fargate but have been using ecs quite a bit this year and I generally would find that by going to (on the dashboard) ecs -> cluster -> service -> events. The events tab gives more specific errors as to what is happening.
My ecs deployment problems are generally summarized into
the container is not exposing the same port as is in the definition, this could be the case if you're deploying from a stack written by someone else.
the task definition memory/cpu restrictions don't grant enough space for the application and it has trouble placing (probably a problem with ecs more than fargate but you never know.)
Your timeout in the task definition is not set to 8 minutes: see this question, it has a lot of this covered
Your start command in the task definition does not work as expected with the container you're trying to deploy
If it is pulling from docker hub continuously my bet would be that it's 1, 3 or 4, and it's attempting to pull the image over and over again.
Try adding a Health check grace period of 60 by going to ECS -> cluster -> service -> update Network Access section.