How to delete too many Cloudformation stacks with status DELETE_COMPLETE - amazon-web-services

At my current site there is a very large number of cloudformation stacks in one account.
If we make an AWS CLI call to list all stacks, we get an error message saying the request has been dynamically throttled, and the request fails.
As per AWS documentation advice to avoid dynamic throttling, I implemented a script to download in smaller chunks, using pagination and exponential delays.
This succeeded but if we could get rid of the many stacks in DELETE_COMPLETE status, this would remove around 800 stacks and the would complete successfully.
How can I remove AWS Cloudformation stacks that are in DELETE_COMPLETE status?
We are also seeing problems in the Cloudformation console with the simplest operations timing out due to the large number of stacks. A request has been raised with AWS for this. The console is useful for development and debugging although all our deployments are automated.
I found an old forum post saying these stacks will auto-delete after 90 days, but we have 800+ of these, some much older than that, and they are still there.
If I delete one of the stacks with a CLI call, like this:
aws cloudformation delete-stack --stack-name arn:aws:cloudformation:eu-west-1:123456789:stack/my-stack-name-here/87654321-1aaa-11aa-00a1-0aa1a0000000
The delete call terminates with no errors but the stack remains as it was.
I can see the call has executed in Cloudtrail.
It looks like the delete-stack operation does nothing if the status is already set to DELETE_COMPLETE.
We need to delete these stacks because there are about 800 of them and we have so many stacks that the console is giving us errors for the simplest tasks, like searching for a stack to edit it.
We did increase the quota size (max number of stacks) via an AWS request but the throttling kicks in when we try to list them all, because there are so many of them.

I found an old forum post saying they will auto-delete after 90 days
Deleted stack records expiring after a certain amount of time is currently the only way deleted stack records can be removed
We have increased the quota size (max number of stacks) via an AWS request
The quota for max number of stacks only applies to active stacks, so this is unrelated
the throttling kicks in when use the console for ordinary actions because there are so many of them
The console has a stack status dropdown next to the search bar to filter by stack status

Related

CloudFormation stack stuck in 'Create-In'Progress'

I have a cloudformation stack, which I am deploying to via my cdk package. My package contains 3 constructs (a Route53 hostedZone, a dnsValidationCertificate, and an IAM role). On a previous account, with the same stack, this took 5 minutes to deploy. However, my stack has been stuck on a 'Create In Progress' state for the past 3 hours, indicating something is definitely wrong. Is there something I could do?
It sounds like the certificate is stuck in pending state waiting for domain ownership verification. Are you able to view your stuck stack in the AWS CloudFormation console, check Events, and inspect the Resources created?
https://docs.aws.amazon.com/acm/latest/userguide/domain-ownership-validation.html

AWS CloudFormation stack stuck in the state UPDATE_ROLLBACK_IN_PROGRESS

I wanted to update my stack. The stack failed with error Function not found: arn:aws:lambda....
And stack in status UPDATE_ROLLBACK_IN_PROGRESS more than 5 hours. How do I stop this process?
If you deleted the function outside of CloudFormation, then you can manually create a new function of the same name. This sometimes helps.
You can also wait till the rollback timeouts. And it usually does after a while, but the time varies.
Another reason why it gets stuck in this state could be due to nested stacks:
Nested Stacks are Stuck in UPDATE_COMPLETE_CLEANUP_IN_PROGRESS, UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS, or UPDATE_ROLLBACK_IN_PROGRESS
In this case a recommended option is indeed to contact support:
To fix the stack, contact AWS customer support.
Recent AWS blog post also describes the issue and possible solutions:
Why is my AWS CloudFormation stack stuck in the state CREATE_IN_PROGRESS, UPDATE_IN_PROGRESS, UPDATE_ROLLBACK_IN_PROGRESS, or DELETE_IN_PROGRESS?
Regarding the time to wait, the timeout varies:
In most situations, you must wait for your AWS CloudFormation stack to time out. The timeout length varies, and is based on the individual resource stabilization requirements that AWS CloudFormation waits for to reach the desired state.
In our case, we have mistakenly passed wrong image name to cloudformation template. After realising the mistake, we tried to stop the stack update, which made the stack stuck for forever in UPDATE_ROLLBACK_IN_PROGRESS status. SO during ECS service creation it got stuck.
Solution:
in Stack event check in which step is in progress. (our case ECS service update)
Go to ECS service.
Click on Update service.
Choose older task definitions.
And Update.
Your Task definition is reset to previous version. And roll back will complete successfully.

AWS CloudFormation Rate Exceeded

I am running a multi-branch pipeline in Jenkins for CI/CD that deploys a CloudFormation stack to my AWS account. Occasionally, when multiple developers push to their branches at the same time, I receive this error on one or more branches:
com.amazonaws.services.cloudformation.model.AmazonCloudFormationException:
Rate exceeded (Service: AmazonCloudFormation; Status Code: 400; Error
Code: Throttling;
This seems to be a rate limit that Amazon has imposed on the number of requests to CloudFormation within a specified time frame.
What is the request limit of CloudFormation, and can I request a limit increase?
No - Not the requests to the cloudformation API.
Most likely the issue will be that Jenkins pipeline requesting for updates every few seconds in order to get the current status. And when you are deploying multiple stacks you will hit this error.
This is probably a bug in the Cloudformation plugin in Jenkins - you'll need to raise a ticket and ask them to implement a backoff of requests if the cfn stack is taking longer than expected, so that it doesn't keep requesting the status of the stack as often.
You could also change your Jenkinsfile's to use the aws-cli which do a better job of managing requests to AWS on cfn updates.

For loop in AWS step functions

We have 20 AWS accounts and we create resources in 10 regions in each account. We want to ensure that AWS resources - ELB, AMI and EBS snapshots are properly tagged. We want to have a service that runs periodically to scan the accounts and delete any of the above mentioned resource that is not properly tagged. We want this to be serverless and we were looking at using Lambda. However, there are 2 issues with Lambda:
Lambda timeout - currently it is 5 mins.
Throttling errors
We need to ensure that we process the next account after the first account processing is completed (we could put a hard sleep for a few minutes and then start processing the next account).
Has someone faced a similar scenario and if so, how was it achieved?
Worst case scenario: we will use ECS.
First, can your innermost task complete in under 5 minutes reliably? If so Lambda is a good fit. Your situation looks to be a good fit.
Next, throttling is easily raised by requesting a higher limit through a support ticket.
Finally, try breaking this up into several smaller functions. Maybe something like this:
delete-resource -- Deletes a single untagged resource
get-untagged-resources -- gets untagged resources in an account and invokes "delete-resource" in an async.each loop
get-accounts -- gets list of accounts and invokes "get-untagged-resources" in an async.each loop
I actually prefer having my functions triggered by SNS rather than invoking them directly, but you get the idea. Hope this helps.

My cloud formation template fails to create resource without producing any error

I have a large template for CloudFormation that has hundreds of resources. All are successfully updated, during an update, except one: an SNS alarm topic.
When deploying the stack, I get no errors, but even if the topic is non-existent the topic is never created.
I'm not expecting anyone to be able to provide me with a solution, but I would simply like to know how to troubleshoot the problem. It would be helpful to get output from the deployment, but the events are so few and really don't reflect the amount of resources being updated/created that they rarely help finding out what goes wrong.
Validation of the template is also successful, but that's almost a given since deploying also succeeds.
Regarding troubleshooting live CloudFormation stacks in general, CloudFormation just recently added support for Change Sets, which should help you preview changes and troubleshoot potential issues with updates before you attempt to apply them to your running stack.
Regarding SNS topics specifically, creating an SNS topic from scratch using the AWS::SNS::Topic resource works correctly. However, if you are using a TopicName property in your SNS resource, make sure that the name is unique across your entire AWS Account, as noted in the documentation:
Resource names must be unique across all of your active stacks. If you reuse templates to create multiple stacks, you must change or remove custom names from your template.
So reusing a constant TopicName in a stack deployed multiple times could cause the issue you're describing.
Also, if you're attempting to update existing CloudFormation-created topics with added/removed subscriptions, note the following important notice in the documentation:
Important
After you create an Amazon SNS topic, you cannot update its properties by using AWS CloudFormation. You can modify an Amazon SNS topic by using the AWS Management Console.
As a potential workaround for adding/removing subscriptions to existing SNS topics via CloudFormation, there is a cloudformation-helpers library containing a Custom::SnsSubscription resource (example) that might help.