AWS ECS task stuck in PROVISIONING state - amazon-web-services

My ECS cluster has 2 t3.xlarge instances and I created a service with 2 tasks, but the tasks remain in PROVISIONING state. Container is running on port 5020 and it is mapped to host port 5040. There are other services running on the same cluster with port mapping 5020:5020 and its working fine.
Should I make any changes to move the task to PENDING/RUNNING state?

I can share one problem and solution scenario, which is not commonly described and you cannot easily find answer, as there is no error message.
The ECS Service is in ACTIVE status, but it's Task is stuck in Last status PROVISIONING and Desired status RUNNING and
Health Status UNKNOWN and zero logs in CloudWatch, thus you have not option to look by Error message.
It appears, that Developers have not pushed Docker Container Image to corresponding AWS ECR! That was it!
ECR repository is empty, does not contain any image to send, thus AWS ECS (Fargate) Service is stuck in PROVISIONING state, as it is waiting for available container image to download.

Related

Getting an error when trying to register EC2 with ECS Cluster

Working on AWS and at a loss with this...
I am trying to register an EC2 instance to an ECS Cluster, the EC2 instance was launched as part of Codestar project.
Steps I have followed as per AWS documentation:
Go to ECS
Access Cluster
Click on Register External Instances
Click to next page
Copy the Curl command for Linux to register the EC2 to the Cluster.
When I input the Curl command into the Linux CLI it runs, however stalls on this line:
Trying to wait for ECS agent to start ...
Soon after I recieve an error that states:
Timed out waiting for ECS Agent to start.
Logs show:
===================================================
level=error time=2022-06-14T18:19:25Z msg="Unable to register as a container instance with ECS: InvalidParameterException: The identity document and identity document signature were not valid." module=client.go
level=error time=2022-06-14T18:19:25Z msg="Error registering container instance" error="InvalidParameterException: The identity document and identity document signature were not valid."
===================================================l
Can anyone help identify what the issue is?
TIA!
Thanks for your input - Mark B
I had a Eureka moment!
Well it dosent resolve the error I posted, but found another way around this.
Basically I do not need to use or register the EC2 instance created by codestar.
I was able to update the codepipeline 'Deploy' stage to deploy to ECS (instances) and with a few tweaks to the codestar IAM role it is working!
So can consider the matter closed.
Many thanks

AWS fargate tasks won't start reliably

I have an ECS cluster with a bunch of different tasks in it (using the same docker image but with different environment variables).
Some of the tasks come up without problem but others fail a lot even though i've used the same VPC, subnet and security-group. The error message shows ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post https://api.ecr..
Bizarre is that the same task sometimes comes up if i create a new task definition or delete the ECR repository and re-upload the docker image.
I'm unable to draw any conclusion out of this..
Update: strange... the task starts successfully when i deregister the task definition and recreate it with the same specs. But only once..
It turns out one have to select the taskExecution role on Task Role - override and Task Execution Role - override in the run task Advanced Options section when starting the task. I don't know why it was arbitrarily working when randomly trying or working when i recreated the task definition every time.

Upload druid and superset image to ECS

I have created a docker image for DRUID and Superset, now I want to push these images to ECR. and start an ECS to run these containers. What I have done is I have created the images by running docker-compose up on my YML file. Now when I type docker image ls i can see multiple images running in them.
I have created an aws account and created a repository. They have provided the push command and I push the superset into the ECR for start. (Didn't push any dependancy)
I created a cluster in AWS, in one configuration step if provided custom port 8088. I don't know what and why they ask these port for.
Then I created a load balancer with the default configuration
After some time I could see the container status turned running
I navigated to the public ip i mentioned with port 8088 and could see superset running
Now I have two problems
It always shows login error in a superset
It stops automatically after some time and restarts after that and this cycle continues.
Should I create different ECR repos and push all the dependencies to ECR before creating a cluster in ECS?
For the service going up and down. Since you mentioned you have an LB associated with the service, you may have an issue with the health check configuration.
If the health check fails consecutively a number of times, ecs will kill it and re-start it.

AWS ECS: Monitoring the status of a service update

I am trying to migrate a set of microservices from Docker Swarm, to AWS ECS using Fargate.
I have created an ECS cluster. Moreover, I have initialized repositories using the ECR, each of which contains an image of a microservice.
I have successfully came up with a way to create new images, and push them into the ECR. In fact, with each change in the code, a new docker image is built, tagged, and pushed.
Moreover, I have created a task definition that is linked to a service. This task definition contains one container, and all the necessary information. Moreover, its service defines that the task will run in a VPC, and is linked to a load balancer, and has a target group. I am assuming that every new deployment uses the image with the "latest" tag.
So far with what I have explained, everything is clear and is working well.
Below is the part that is confusing me. After every new build, I would like to update the service in order for new tasks with the update image get deployed. I am using the cli to do so with the following command:
aws ecs update-service --cluster <cluster-name> --service <service-name>
Typically, after performing the command, I am monitoring the deployment logs, under the event tab, and checking the state of the service using the following command:
aws ecs describe-services --cluster <cluster-name> --service <service-name>
Finally, I tried to simulate a case where the newly created image contains a bad code. Thus, the new tasks will not be able to get deployed. What I have witnessed is that Fargate will keep trying (without stopping) to deploy the new tasks. Moreover, aside the event logs, the describe-services command does not contain relevant information, other than what Fargate is doing (e.g., registering/deregistering tasks). I am surprised that I could not find any mechanism that instructs Fargate, or the service to stop the deployment and rollback to the already existing one.
I found this article (https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/ ), which provides a solution. However, it is a fairly complicated one, and assumes that each new deployment is triggered by a new task definition, which is not what I want.
Therefore, considering what I have described above, I hope you can answer the following questions:
1) Using CLI commands (For automation purposes) Is there a way to instruct Fargate to automatically stop the current deployment, after failing to deploy new tasks after a few tries?
2) Using the CLI commands, is there a way to monitor the current status of the deployment? For instance, when performing a service update on a service on Docker swarm, the terminal generates live logs on the update process
3) After a failed deployment, is there a way for Fargate to signal an error code, or flag, or message?
At the moment, ECS does not offer deployment status directly. Once you issue a deployment, there is no way to determine its status other than to continually poll for updates until you have enough information to infer from them. Plus unexpected container exits are not logged anywhere. You have to search through failed tasks. The way I get them is by cloudwatch rule that triggers a lambda upon task state change.
I recommend you read: https://medium.com/#aaron.kaz.music/monitoring-the-health-of-ecs-service-deployments-baeea41ae737
As of now, you have a way to do this:
aws ecs wait services-stable --cluster MyCluster --services MyService
The previous example pauses and continues only after it can confirm that the service running on the cluster is stable. Will return 255 exit code after 40 failed checks.
To cancel a deployment, enable ECS Circuit Breaker when creating your service:
aws ecs create-service \
--service-name MyService \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true}" \
{...}
References:
Service deployment check.
Circuit Breaker

AWS Fargate 503 Service Temporarily Unavailable

I'm trying to deploy backend application to the AWS Fargate using cloudformation templates that I found. When I was using the docker image training/webapp I was able to successfully deploy it and access with the externalUrl from the networking stack for the app.
When I try to deploy our backend image I can see the stacks are deploying correctly but when I try to go to the externalUrl I get 503 Service Temporarily Unavailable and I'm unable to see it... Another thing that I've noticed is on the docker hub I can see that the image is continuously pulled all the time when the cloudformation services are running...
The backend is some kind of maven project I don't know exactly what but I know that locally its working but to get it up running the container with this backend image takes about 8 minutes... I'm not sure if this affects the Fargate ?? Any Idea how to get it working ?
It sounds like you need to find the actual error that you're experiencing, the 503 isn't enough information. Can you provide some other context?
I'm not familiar with fargate but have been using ecs quite a bit this year and I generally would find that by going to (on the dashboard) ecs -> cluster -> service -> events. The events tab gives more specific errors as to what is happening.
My ecs deployment problems are generally summarized into
the container is not exposing the same port as is in the definition, this could be the case if you're deploying from a stack written by someone else.
the task definition memory/cpu restrictions don't grant enough space for the application and it has trouble placing (probably a problem with ecs more than fargate but you never know.)
Your timeout in the task definition is not set to 8 minutes: see this question, it has a lot of this covered
Your start command in the task definition does not work as expected with the container you're trying to deploy
If it is pulling from docker hub continuously my bet would be that it's 1, 3 or 4, and it's attempting to pull the image over and over again.
Try adding a Health check grace period of 60 by going to ECS -> cluster -> service -> update Network Access section.