AWS CloudFormation stack stuck in the state UPDATE_ROLLBACK_IN_PROGRESS - amazon-web-services

I wanted to update my stack. The stack failed with error Function not found: arn:aws:lambda....
And stack in status UPDATE_ROLLBACK_IN_PROGRESS more than 5 hours. How do I stop this process?

If you deleted the function outside of CloudFormation, then you can manually create a new function of the same name. This sometimes helps.
You can also wait till the rollback timeouts. And it usually does after a while, but the time varies.
Another reason why it gets stuck in this state could be due to nested stacks:
Nested Stacks are Stuck in UPDATE_COMPLETE_CLEANUP_IN_PROGRESS, UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS, or UPDATE_ROLLBACK_IN_PROGRESS
In this case a recommended option is indeed to contact support:
To fix the stack, contact AWS customer support.
Recent AWS blog post also describes the issue and possible solutions:
Why is my AWS CloudFormation stack stuck in the state CREATE_IN_PROGRESS, UPDATE_IN_PROGRESS, UPDATE_ROLLBACK_IN_PROGRESS, or DELETE_IN_PROGRESS?
Regarding the time to wait, the timeout varies:
In most situations, you must wait for your AWS CloudFormation stack to time out. The timeout length varies, and is based on the individual resource stabilization requirements that AWS CloudFormation waits for to reach the desired state.

In our case, we have mistakenly passed wrong image name to cloudformation template. After realising the mistake, we tried to stop the stack update, which made the stack stuck for forever in UPDATE_ROLLBACK_IN_PROGRESS status. SO during ECS service creation it got stuck.
Solution:
in Stack event check in which step is in progress. (our case ECS service update)
Go to ECS service.
Click on Update service.
Choose older task definitions.
And Update.
Your Task definition is reset to previous version. And roll back will complete successfully.

Related

CloudFormation stack stuck in 'Create-In'Progress'

I have a cloudformation stack, which I am deploying to via my cdk package. My package contains 3 constructs (a Route53 hostedZone, a dnsValidationCertificate, and an IAM role). On a previous account, with the same stack, this took 5 minutes to deploy. However, my stack has been stuck on a 'Create In Progress' state for the past 3 hours, indicating something is definitely wrong. Is there something I could do?
It sounds like the certificate is stuck in pending state waiting for domain ownership verification. Are you able to view your stuck stack in the AWS CloudFormation console, check Events, and inspect the Resources created?
https://docs.aws.amazon.com/acm/latest/userguide/domain-ownership-validation.html

How to delete too many Cloudformation stacks with status DELETE_COMPLETE

At my current site there is a very large number of cloudformation stacks in one account.
If we make an AWS CLI call to list all stacks, we get an error message saying the request has been dynamically throttled, and the request fails.
As per AWS documentation advice to avoid dynamic throttling, I implemented a script to download in smaller chunks, using pagination and exponential delays.
This succeeded but if we could get rid of the many stacks in DELETE_COMPLETE status, this would remove around 800 stacks and the would complete successfully.
How can I remove AWS Cloudformation stacks that are in DELETE_COMPLETE status?
We are also seeing problems in the Cloudformation console with the simplest operations timing out due to the large number of stacks. A request has been raised with AWS for this. The console is useful for development and debugging although all our deployments are automated.
I found an old forum post saying these stacks will auto-delete after 90 days, but we have 800+ of these, some much older than that, and they are still there.
If I delete one of the stacks with a CLI call, like this:
aws cloudformation delete-stack --stack-name arn:aws:cloudformation:eu-west-1:123456789:stack/my-stack-name-here/87654321-1aaa-11aa-00a1-0aa1a0000000
The delete call terminates with no errors but the stack remains as it was.
I can see the call has executed in Cloudtrail.
It looks like the delete-stack operation does nothing if the status is already set to DELETE_COMPLETE.
We need to delete these stacks because there are about 800 of them and we have so many stacks that the console is giving us errors for the simplest tasks, like searching for a stack to edit it.
We did increase the quota size (max number of stacks) via an AWS request but the throttling kicks in when we try to list them all, because there are so many of them.
I found an old forum post saying they will auto-delete after 90 days
Deleted stack records expiring after a certain amount of time is currently the only way deleted stack records can be removed
We have increased the quota size (max number of stacks) via an AWS request
The quota for max number of stacks only applies to active stacks, so this is unrelated
the throttling kicks in when use the console for ordinary actions because there are so many of them
The console has a stack status dropdown next to the search bar to filter by stack status

Creation and Scheduled deletion of AWS Cloud Formation Stack

I am trying to setup an environment on AWS by launching a stack via AWS Cloud Formation template. The stack would be created and then be scheduled for deletion automatically based on the TTL parameter in the template. There seems to be a problem only when the instance is getting launched, it errors out that "Failed to receive 1 resource signal(s) within the specified duration"
If anyone could point out what I am doing wrong in the template, it would be great.
Here is the link for the template in YAML: https://s3.ca-central-1.amazonaws.com/rkbucket028/aws-openshit-cf-template_new.yml
I have already followed this article but there seems to be something wrong with it as well:-
https://aws.amazon.com/blogs/devops/scheduling-automatic-deletion-of-application-environments/#
CloudFormation rolls back if any of the resources have failed to be created (ie didnt responed in the predetermined duration). If you believe that it is only the process that is taking longer and not a genuine failure, you can either incorporate the wait condition, or better use resource creation policy time and count.
Source:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html
PS: Your template url is not accessible. Check your bucket and file permissions on S3 bucket and share the public url

CloudFormation issue: couldn't delete stack

I create CloudFormation template for our resources, it includes Lambda functions, API Gateways, Roles, etc. For verifying our template I create CloudFormation stack using it, check some resources which I updated and after that I delete stack. But last time I got such message when I tried to delete stack:
CloudFormation is waiting for NetworkInterfaces associated with the
Lambda Function to be cleaned up.
I tried to stop deletion process and restart it, but I faced the same issue again. What is the problem and how can I fix that?
This is a well known issue. There are couple of things you can do.
1) Wait for the deletion to fail. Then try to delete. It should show you a checkbox to skip NetworkInterface. Select that.
2) Go to EC2-->NetworkInterfaces and detach/delete the NIC that was used by your resources. Then delete your CFT stack.

For loop in AWS step functions

We have 20 AWS accounts and we create resources in 10 regions in each account. We want to ensure that AWS resources - ELB, AMI and EBS snapshots are properly tagged. We want to have a service that runs periodically to scan the accounts and delete any of the above mentioned resource that is not properly tagged. We want this to be serverless and we were looking at using Lambda. However, there are 2 issues with Lambda:
Lambda timeout - currently it is 5 mins.
Throttling errors
We need to ensure that we process the next account after the first account processing is completed (we could put a hard sleep for a few minutes and then start processing the next account).
Has someone faced a similar scenario and if so, how was it achieved?
Worst case scenario: we will use ECS.
First, can your innermost task complete in under 5 minutes reliably? If so Lambda is a good fit. Your situation looks to be a good fit.
Next, throttling is easily raised by requesting a higher limit through a support ticket.
Finally, try breaking this up into several smaller functions. Maybe something like this:
delete-resource -- Deletes a single untagged resource
get-untagged-resources -- gets untagged resources in an account and invokes "delete-resource" in an async.each loop
get-accounts -- gets list of accounts and invokes "get-untagged-resources" in an async.each loop
I actually prefer having my functions triggered by SNS rather than invoking them directly, but you get the idea. Hope this helps.