Port error while executing update on cloudformation - amazon-web-services

I changed some environment variables in the task definition part and executed the changeset.
The task definition got updated successfully but the update of service got stuck in cloudformation.
On checking the events in the cluster I found the following:
It is adding new task but the old one is already running consuming port so it is stuck. what can be done to resolve this. I can always delete and run the CF script again but I need to create a pipeline so I want the update stack to work.

This UPDATE_IN_PROGRESS will take around 3 hours until DescribeService API timeout.
If you can't wait then you need to manually force the state of the Amazon ECS service resource in AWS CloudFormation into a CREATE_COMPLETE state by
setting the desired count of the service to zero in the Amazon ECS console to stop running tasks. AWS CloudFormation then considers the update as successful, because the number of tasks equals the desired count of zero.
This blog explains the cause of the message and its fix in detail.
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-ecs-service-stabilize/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-service-stuck-update-status/?nc1=h_ls

Related

AWS fargate tasks won't start reliably

I have an ECS cluster with a bunch of different tasks in it (using the same docker image but with different environment variables).
Some of the tasks come up without problem but others fail a lot even though i've used the same VPC, subnet and security-group. The error message shows ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post https://api.ecr..
Bizarre is that the same task sometimes comes up if i create a new task definition or delete the ECR repository and re-upload the docker image.
I'm unable to draw any conclusion out of this..
Update: strange... the task starts successfully when i deregister the task definition and recreate it with the same specs. But only once..
It turns out one have to select the taskExecution role on Task Role - override and Task Execution Role - override in the run task Advanced Options section when starting the task. I don't know why it was arbitrarily working when randomly trying or working when i recreated the task definition every time.

Amazon ECS : how to schedule a container?

I have a very simple ECS cluster using Fargate. I'd like to schedule a container to be run using a cron expression.
I created the task definition and a rule pointing to it using the EventBridge console, but I see nothing getting launched on the cluster. No logs, not even a trace of anything starting apart from the "monitor" tab of the rule which says it was triggered (but then again, I don't see any logs).
I'm guessing this might have to do with the public IP somehow needed for the rule to pull the container using Fargate? In the creation there is a setting called auto-assign public IP address but it only shows the DISABLED option.
Has anyone had the same problem? Should I just schedule a normal service with sleep times of 24hours between executions and risk a higher cost? Cheers
Since you mention that you have no issues running the task manually in the cluster, it's likely that the problem with EventBridge is that the role associated with the rule does not have enough permissions to run the task.
You can confirm this by checking CloudTrail logs. You'll find a RunTask event with a failure similar to the following:
User: arn:aws:sts::xxxx:assumed-role/Amazon_EventBridge_Invoke_ECS/xxx is not authorized to perform: ecs:RunTask on resource: arn:aws:ecs:us-east-1:xxxx:task-definition/ECS_task

How to debug "Resource creation timed out waiting for completion" in AWS Cloudformation?

I'm brand new to AWS and I have a script which I believe should create an ECS cluster.
When I run the script, my stack hangs in the CREATE_IN_PROGRESS state for over an hour. Eventually, it fails and goes into ROLLBACK_COMPLETE.
When I'm in Cloudformation in the AWS console, I can go to "Events" and see that two Services which I'm trying to create are causing stack creation to fail. However, the only error message is Resource creation timed out waiting for completion.
I've tried the steps outlined here, namely, including going in to CloudTrail, but I'm not really sure what to look for and haven't found anything to help me resolve the issue. Again, I'm an AWS noob.
What are some steps I can take to get a more detailed error message? How do I go about debugging in AWS?
Any help is appreciated, let me know if I need to provide more info.
I was running into the same situation with CDK where my ECS would fail after 3 hours of CREATE_IN_PROGRESS. A big issue with debugging and troubleshooting is when ROLLBACK happened it wipes your ECS cluster and the event history. However, if you go to the ECS console's Task list you should see a task and I bet you it's stuck in a PENDING state. There are a lot of reasons for this. When the Task fails to reach the desired state it'll add the reason it failed to the Service's Events. To get there:
On this page there's an Events tab
Select a Task and it show that it STOPPED. In my case below it looks like it couldn't find the ECS container template image

Fargate sceduled task FailedInvocation error

I have a fargate task that I want to run as a scheduled task every n minutes. I have a task definition that works perfectly as expected (with cloud watch logs as expected and VPC connections working properly). That is when I run it as a task or a service. However, when I try to run it as scheduled task, it does not start. I checked the cloudwatch logs, however, there are no log entries in the log group. If I lookup the metrics page, I see a FailedInvocations entry under the metric name.
I understand that it is a bit tricky to schedule a task in fargate, as we have to go to cloudwatch rules, and update the scheduled task there, in order to add subnets and define a security group, as this option is not available when creating the scheduled task through my ECS cluster page.
I also have studied the documentation page here, and also checked this question. But I still cannot understand why it does not work. Thank you in advance.
This seems like an issue with the web interface of AWS for scheduled tasks, as they don't let me set the assignPublicIp to enabled.
Without this, the Fargate task cannot pull images from the ECR registry. However, when I started this task using boto3 using a lambda function that gets called through cloudwatch rules, it works fine.

delete ECS task for old revision - only use new revision

As you can see I have task definition for revision 4 and a task definition for revision 5. I want permanently stop running 4, and only run 5:
So in other words, the task that is PROVISIONING - I only want that one. The task that is RUNNING - I don't want that one to run anymore. How to achieve this?
I tried to replicate the scenario and it went well for me. So what I think is you need to dig further under the hood.
Your task is in provisioning state which I believe is related to your environment and not related to your task, service or cluster.
From the AWS Documentation :
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-lifecycle.html
**PROVISIONING**
Amazon ECS has to perform additional steps before the task is launched. For example, for tasks that use the awsvpc network mode, the elastic network interface needs to be provisioned.
You might want to check below things to start debugging :
Cloudformation template that ECS use to provision your resources.
Try looking into your VPC if anything got changed since the last deployment.
Security Groups, IAM Roles to find out if anything blocking your resource creation.