I'm running into the following vague error messages while attempting to delete an AWS Batch Compute Environment.
In the output from terraform destroy:
Error: error waiting for Batch Compute Environment (xyzabc) delete: timeout while waiting for resource to be gone (last state: 'DELETING', timeout: 20m0s)
Under the AWS console's "Status reason":
INTERNAL_ERROR - Failed compute environment workflow
The resource's status goes from "DELETING" to "INVALID" after some amount of time.
How can I force the deletion of this resource?
Related
I am deploying a service on aws using an ApplicationLoadBalancedEc2Service.
Sometimes while doing some testing, I deploy a configuration that results in errors. The problem is that instead of canceling the deployment, the cdk just hangs for hours. The reason is that AWS tries to keep spinning up a task (which fails due to my wrong configuration).
Right now I have to set the task number to 0 through the AWS console. This will cause to successfully complete the deployment and allow me to spin a new version.
Is there a way to cancel the deployment and just rollback after X amount of failed tasks?
One way is to configure CodeDeploy to roll back the service to its previous version if the new deployment fails. This won't "cancel the CDK deployment", but will stabilize the service.
Another way is to add a Custom Resource with an asynchronous provider to poll the ECS service status, signaling CloudFormation if your success condition is not met. This will revert the CDK deployment itself.
You're looking for the Circuit Breaker feature:
declare const cluster: ecs.Cluster;
const loadBalancedEcsService = new ecsPatterns.ApplicationLoadBalancedEc2Service(this, 'Service', {
cluster,
memoryLimitMiB: 1024,
taskImageOptions: {
image: ecs.ContainerImage.fromRegistry('test'),
},
desiredCount: 2,
circuitBreaker: { rollback: true }
});
It will give your deploy between 10 and 200 tries (0.5 times your desired task count, with these min/max values), before to cancel your deploy. The rollback argument allows you to re-launch tasks with the previous task definition.
To reproduce:
Create a CloudFormation stack containing an RDS instance.
Attempt to delete the stack with
aws cloudformation delete-stack --stack-name=[stack name]
aws cloudformation wait stack-delete-complete --stack-name=[stack name]
Once the second command returns DELETE_FAILED check the stack events list to find this message:
One or more database instances are still members of this parameter group […], so the group cannot be deleted
Manually force deletion of the database instance with
aws rds delete-db-instance --db-instance-identifier=[DB physical ID] --skip-final-snapshot
aws rds wait db-instance-deleted --db-instance-identifier=[DB physical ID]
Repeat step 2.
Once the second command returns DELETE_FAILED check the stack events list to find these messages:
Secrets Manager can't find the specified secret.
and
The following resource(s) failed to delete: [DB logical ID].
Now what? The secret and DB are gone, but the stack can't be deleted.
Last time this happened I was told by AWS Support to simply wait until the stack "caught up" with the fact that the database instance was deleted, but that's not ideal as it takes more than 12 hours.
Sort of relevant: How do I delete an AWS CloudFormation stack that's stuck in DELETE_FAILED status?
AWS EKS Cluster 1.18 with AWS CSI EBS driver. Some pods had EBS volumes statically provisioned and everything was working.
Next. At some point all the pods using EBS volumes stopped responding, services had infinite waiting time and the proxy pod was killing the connection because of the timeout.
Logs (CloudWatch) for kube-controller-manager were filled with such messages:
kubernetes.io/csi: attachment for vol-00c1763<removed-by-me> failed:
rpc error:
code = NotFound desc = Instance "i-0c356612<removed-by-me>" not found
and
event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"podname-65df9bc5c4-2vtj8", UID:"ad4d30b7-<removed-by-me>", APIVersion:"v1", ResourceVersion:"4414194", FieldPath:""}):
type: 'Warning'
reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "ebs-volumename" :
rpc error: code = NotFound desc = Instance "i-0c356<removed-by-me>" not found
The instance is there, we checked it like 20 times. We tried to kill the instance, so the CloudFormation creates a new one for us, the error persists, just the instance ID is different.
Next. We started deleting pods and unmounting volumes / deleting sc/pvc/pv.
kubectl stuck at the end of deleting pv.
We were only able to get them out of this state by patching no finalizers to both pv and volumemounts.
The logs contain the following:
csi_attacher.go:662] kubernetes.io/csi: detachment for VolumeAttachment for volume [vol-00c176<removed-by-me>] failed:
rpc error: code = Internal desc = Could not detach volume "vol-00c176<removed-by-me>" from node "i-0c3566<removed-by-me>":
error listing AWS instances:
"WebIdentityErr: failed to retrieve credentials\n
caused by: ExpiredTokenException: Token expired: current date/time 1617384213 must be before the expiration date/time1616724855\n\
tstatus code: 400, request id: c1cf537f-a14d<removed-by-me>"
I've read about the tokens for Kubernetes, but in our case we have everything being managed by EKS. Googling the ExpiredTokenException brings us to the pages of how you should solve the issues with your own applications, again, we manage everything on AWS using kubectl.
export TF_WARN_OUTPUT_ERRORS=1
terraform destroy
Error: Error applying plan:
2 error(s) occurred:
module.dev_vpc.aws_internet_gateway.eks_vpc_ig_gw (destroy): 1 error(s) occurred:
aws_internet_gateway.eks_vpc_ig_gw: Error waiting for internet gateway (0980f3434343410c209) to detach: timeout while waiting for state to become 'detached' (last state: 'detaching', timeout: 15m0s)
module.dev_vpc.aws_subnet.production_public_subnets[1] (destroy): 1 error(s) occurred:
aws_subnet.production_public_subnets.1: error deleting subnet (subnet-04ad0a3a0171c861c): timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)
There was a load balancer attached to EC2 instance. I logon to AWS console, manually removed the load balancer, ran terraform destroy. It is successfully destroyed.
Same issue I also faced, deployed whole infra with Terraform but deployed application manually and that in process created an ALB, small but painful. So first remove the infra which is not part of Terraform and then destroy will work fine.
I used to deploy a Java web application to Elastic Beanstalk (EC2) as root user without this problem. Now I'm using a recommended way of deploying as IAM service user and I get the following errors. I suspect it's because of lack of permissions (policies) but I don't know what policies should I assign to the IAM user.
QUESTION: Could you help me in finding the right policies?
commands:
eb init --profile eb_admin
eb create --single
output of the 2nd command:
Printing Status:
2019-05-26 12:08:58 INFO createEnvironment is starting.
2019-05-26 12:08:59 INFO Using elasticbeanstalk-eu-central-1-726173845157 as Amazon S3 storage bucket for environment data.
2019-05-26 12:09:26 INFO Created security group named: awseb-e-ire9qdzahd-stack-AWSEBSecurityGroup-L5VUAQLDAA9F
2019-05-26 12:09:42 ERROR Stack named 'awseb-e-ire9qdzahd-stack' aborted operation. Current state: 'CREATE_FAILED' Reason: The following resource(s) failed to create: [MountTargetSecurityGroup, AWSEBEIP, sslSecurityGroupIngress, FileSystem].
2019-05-26 12:09:42 ERROR Creating security group failed Reason: The vpc ID 'vpc-7166611a' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidVpcID.NotFound; Request ID: c1d0ce4d-830d-4b0c-9f84-85d8da4f7243)
2019-05-26 12:09:42 ERROR Creating EIP: 54.93.84.166 failed. Reason: Resource creation cancelled
2019-05-26 12:09:42 ERROR Creating security group ingress named: sslSecurityGroupIngress failed Reason: Resource creation cancelled
2019-05-26 12:09:44 INFO Launched environment: stack-overflow-dev. However, there were issues during launch. See event log for details.
Important!
I use a few .ebextensions scripts in order to initialize the environment:
nginx
https-instance-securitygroup
storage-efs-createfilesystem
storage-efs-mountfilesystem
After reviewing the logs, I also noticed that I forgot to create VPC which is required for EFS filesystem. Could it be that 1 failed script (storage-efs-createfilesystem) is the root cause of subsequent failing operations?
Yes, the lack of VPC has caused the other resources to fail to create. Elastic Beanstalk and the storage-efs-createfilesystem extension use CloudFormation underneath.
storage-efs-createfilesystem Cfn template creates MountTargetSecurityGroup SG and that failed due to lack of VPC. The AWSEBEIP, sslSecurityGroupIngress and FileSystem resource creation is then cancelled.