How do I resolve PodEvictionFailure error in AWS EKS? - amazon-web-services

I am trying to upgrade my node group in AWS EKS.
I am using CDK and I am getting the following error
Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)
According to aws doc, PodEvictionFailure can occur if the deployment tolerates every taint, and the node can never become empty.
https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade
Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.
I checked my nodes and all the pods running on the node and found the following pods which tolerates every taint.
both of the following pods have the following tolerations.
Pod: kube-system/aws-node-pdmbh
Pod: kube-system/kube-proxy-7n2kf
{
...
...
"tolerations": [
{
"operator": "Exists"
},
{
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"effect": "NoExecute"
},
{
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"effect": "NoExecute"
},
{
"key": "node.kubernetes.io/disk-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/memory-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/pid-pressure",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/unschedulable",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/network-unavailable",
"operator": "Exists",
"effect": "NoSchedule"
}
]
}
Do I need to change the tolerations of these pods to avoid tolerating all taints? If so, how, as these are pods managed by AWS.
How can I avoid PodEvictionFailure?

As suggested by #Ola Ekdahl, also in Amazon AWS doc you shared - it's better to use force flag rather than change the tolerations for the pods. See: https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html ("Upgrade phase" #2)
You can add the force flag like following and see if that helps:
new eks.Nodegroup(this, 'myNodeGroup', {
cluster: this.cluster,
forceUpdate: true,
releaseVersion: '<AMI ID obtained from changelog>',
...
});

Related

How do I successfully retrieve an ALB ListenerArn with CloudFormation to setup ListenerRules?

I'm starting to think there is a fundamental flaw in AWS Cloudformation Template validation/resource lookup related to "Type": "AWS::ElasticLoadBalancingV2::ListenerRule", resources.
Specifically, every time I try to create a new ListenerRule for known working Listeners, Cloudformation errors out with
Unable to retrieve ListenerArn attribute for AWS::ElasticLoadBalancingV2::Listener, with error message One or more listeners not found (Service: ElasticLoadBalancingV2, Status Code: 400, Request ID: c6914f71-074c-4367-983a-bcf1d8fd1350, Extended Request ID: null)
Upon testing, I can make it work by hardcoding the ListenArn attribute in my template, but that's not a solution since the template is used for multiple Stacks with different resources.
Below are the relevant parts of the template:
"WLBListenerHttp": {
"Type": "AWS::ElasticLoadBalancingV2::Listener",
"Properties": {
"DefaultActions": [{
"Type": "forward",
"TargetGroupArn": { "Ref": "WLBTargetGroupHttp" }
}],
"LoadBalancerArn": { "Ref": "WebLoadBalancer" },
"Port": 80,
"Protocol": "HTTP"
}
},
"ListenerRuleHttp": {
"DependsOn": "WLBListenerHttp",
"Type": "AWS::ElasticLoadBalancingV2::ListenerRule",
"Properties": {
"Actions": [{
"Type": "fixed-response",
"FixedResponseConfig": { "StatusCode": "200" }
}],
"Conditions": [{
"Field": "host-header",
"HostHeaderConfig": { "Values": ["domain*"] }
}, {
"Field": "path-pattern",
"PathPatternConfig": { "Values": ["/path/to/respond/to"] }
}],
"ListenerArn": { "Fn::GetAtt": ["WLBListenerHttp", "ListenerArn"] },
"Priority": 1
}
},
Per the documentation on listeners, Fn::GetAtt or Ref should both return the ListenerARN:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-listener.html
"Return values
Ref
When you pass the logical ID of this resource to the intrinsic Ref function, Ref returns the Amazon Resource Name (ARN) of the listener.
For more information about using the Ref function, see Ref.
Fn::GetAtt
The Fn::GetAtt intrinsic function returns a value for a specified attribute of this type. The following are the available attributes and sample return values.
For more information about using the Fn::GetAtt intrinsic function, see Fn::GetAtt.
ListenerArn
The Amazon Resource Name (ARN) of the listener."
I've tried both "ListenerArn": { "Fn::GetAtt": ["WLBListenerHttp", "ListenerArn"] }, and "ListenerArn": { "Ref": "WLBListenerHttp"}, with no success, resulting in the error noted. If I hardcode the Arn "ListenerArn": "arn::", with the full Arn, it works fine.
As it turns out, my syntax was perfectly fine. However, what I didn't realize is that while the WLBListenerHttp resource existed, it was not actually the same ARN as the one created by CloudFormation. Apparently, someone accidentally deleted it at some point without telling us and then manually recreated it. This left the account in a broken state where CloudFormation had an ARN recorded for the listener from when it was created, but it was truly no longer valid since the new resource had a new ARN.
The solution to this was to delete the offending resource manually, then change the name of it slightly in our CloudFormation templates so it would create a new one.

CDK Unable to Add CodeStarNotification to CodePipeline

I use CDK to deploy a codepipeline. It works fine until I try to add notification for codepipeline success/fail events. It gives CREATE_FAILED error with message Resource handler returned message: "Invalid request provided: AWS::CodeStarNotifications::NotificationRule" (RequestToken: bb566fd0-1ac9-5d61-03fe-f9c27b4196fa, HandlerErrorCode: InvalidRequest). What could be the reason? Thanks.
import * as codepipeline from "#aws-cdk/aws-codepipeline";
import * as codepipeline_actions from "#aws-cdk/aws-codepipeline-actions";
import * as codestar_noti from "#aws-cdk/aws-codestarnotifications";
import * as sns from "#aws-cdk/aws-sns";
const pipeline = new codepipeline.Pipeline(...);
const topicArn = props.sns_arn_for_developer;
const targetTopic = sns.Topic.fromTopicArn(
this,
"sns-notification-topic",
topicArn
);
new codestar_noti.NotificationRule(this, "Notification", {
detailType: codestar_noti.DetailType.BASIC,
events: [
"codepipeline-pipeline-pipeline-execution-started",
"codepipeline-pipeline-pipeline-execution-failed",
"codepipeline-pipeline-pipeline-execution-succeeded",
"codepipeline-pipeline-pipeline-execution-canceled",
],
source: pipeline,
targets: [targetTopic],
});
Here is the snippet of generated cloudformation tempalte.
"Notification2267453E": {
"Type": "AWS::CodeStarNotifications::NotificationRule",
"Properties": {
"DetailType": "BASIC",
"EventTypeIds": [
"codepipeline-pipeline-pipeline-execution-started",
"codepipeline-pipeline-pipeline-execution-failed",
"codepipeline-pipeline-pipeline-execution-succeeded",
"codepipeline-pipeline-pipeline-execution-canceled"
],
"Name": "sagemakerbringyourownNotification36194CEC",
"Resource": {
"Fn::Join": [
"",
[
"arn:",
{
"Ref": "AWS::Partition"
},
":codepipeline:ap-southeast-1:305326993135:",
{
"Ref": "sagemakerbringyourownpipeline0A8C43B1"
}
]
]
},
"Targets": [
{
"TargetAddress": "arn:aws:sns:ap-southeast-1:305326993135:whitespace_alerts",
"TargetType": "SNS"
}
]
},
"Metadata": {
"aws:cdk:path": "sagemaker-bring-your-own/Notification/Resource"
}
},
FWIW, I got the exact same error "Invalid request provided: AWS::CodeStarNotifications::NotificationRule" from a CDK app where the Topic was created (not imported). It turned out to be a transient issue, because it succeeded the second time without any changes. I suspect it was due to a very large ECR image which was build the first time as part of the deploy and which took quite some time. My speculation is that the Topic timed out and got into some kind of weird state waiting for the NotificationRule to be created.
This is because imported resources cannot be modified. As you pointed out in the comments, setting up the notification involves modifying the Topic resource, specifically its access policy.
Reference: https://docs.aws.amazon.com/cdk/v2/guide/resources.html#resources_importing
I was able to solve this by doing the following in that order:
First removing the below statement from the resource policy of the SNS topic.
Then deploying the stack(which interestingly doesn't add anything to the resource policy)
Once the stack deployment finishes, update the resource policy manually to add the below statement.
{
"Sid": "AWSCodeStarNotifications_publish",
"Effect": "Allow",
"Principal": {
"Service": "codestar-notifications.amazonaws.com"
},
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:ap-south-1:xxxxxxxxx:test"
}

In CloudFormation, does "A DependsOn B" ensure that A is deleted before B?

We are using CloudFormation to set up a role and a policy for it. The policy is set to depend on the role using the "DependsOn" property like so:
Role definition:
"LambdaExecutionRole": {
"Type": "AWS::IAM::Role",
"Properties": {
[...]
Policy definition:
"lambdaexecutionpolicy": {
"DependsOn": [
"LambdaExecutionRole"
],
"Roles": [
{
"Ref": "LambdaExecutionRole"
}
],
[...]
From the official documentation, I understand that this DependsOn relation between the two entities should ensure that the policy is always deleted before the role.
Resource A is deleted before resource B.
However, we encounter an error where it appears that the system tries to delete the role before the policy:
Resource Name: [...] (AWS::IAM::Role)
Event Type: delete
Reason: Cannot delete entity, must delete policies first. (Service: AmazonIdentityManagement; Status Code: 409; Error Code: DeleteConflict; Request ID: [...]; Proxy: null)
I'm not sure how that's even possible, as I would have considered the "A DependsOn B" to ensure that the system never tries to delete B before deleting A. Is my understanding wrong here? Can there be a situation where the system tries to delete B before A?
And yes, I understand that in this case the obvious solution is to use an inline policy, as the policy is only used for this specific role. But as this behavior seems to conflict with my intuitive understanding of the official documentation, I want to properly understand what the "DependsOn" property actually means.
TL;DR Unable to replicate the error. DependsOn does not seem to be the culprit.
I used the CDK to create two versions of a minimum test stack with only two resources, AWS::IAM::Role and AWS::IAM::ManagedPolicy. V1 had no explicit policy dependency set on the role. V2, like the OP, did. The difference made no difference. Both versions deployed and were destroyed without error.
Version 1: CDK-Generated Default: no 'depends on' in the template
Version 2 (as in OP): has explicit dependency - Policy depends on the Role. The CDK added one line to the template: "DependsOn": [ "TestRole6C9272DF" ] under "TestPolicyCC05E598"
The two versions differed only by that single DependsOn. Both versions deployed and destroyed as expected without error.
// resource section of CDK-generated Cloud Formation Template
"Resources": {
"TestRole6C9272DF": {
"Type": "AWS::IAM::Role",
"Properties": {
"AssumeRolePolicyDocument": {
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
}
}
],
"Version": "2012-10-17"
}
},
"Metadata": {
"aws:cdk:path": "TsCdkPlaygroundIamDependencyStack/TestRole/Resource"
}
},
"TestPolicyCC05E598": {
"Type": "AWS::IAM::ManagedPolicy",
"Properties": {
"PolicyDocument": {
"Statement": [
{
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Effect": "Allow",
"Resource": "*"
}
],
"Version": "2012-10-17"
},
"Description": "",
"Path": "/",
"Roles": [
{
"Ref": "TestRole6C9272DF"
}
]
},
"DependsOn": [
"TestRole6C9272DF" // <-- The difference that makes no difference
],
"Metadata": {
"aws:cdk:path": "TsCdkPlaygroundIamDependencyStack/TestPolicy/Resource"
}
},

Cloudformation template properties documentation discrepancy

I'm creating my first Cloudformation template using an archived Github project from an AWS Blog:
https://aws.amazon.com/blogs/devops/part-1-develop-deploy-and-manage-for-scale-with-elastic-beanstalk-and-cloudformation-series/
https://github.com/amazon-archives/amediamanager
The template amm-elasticbeanstalk.cfn.json declares an Elastic Beanstalk resource, outlined here:
"Resources": {
"Application": {
"Type": "AWS::ElasticBeanstalk::Application",
"Properties": {
"ConfigurationTemplates": [{...}],
"ApplicationVersions": [{...}]
}
}
}
From the documentation I'm under the impression that AWS::ElasticBeanstalk::ApplicationVersion and AWS::ElasticBeanstalk::ConfigurationTemplate must be defined as separate resources, yet the example I'm working from is using the same AWSTemplateFormatVersion as the documentation. Is this a "shorthand" where namespaces can be nested if they have the same parent (i.e. AWS::ElasticBeanstalk)? Is it documented somewhere?
In the same file AWS::ElasticBeanstalk::Environment is defined as a separate resource - is this just a stylistic choice, perhaps because the environment configuration is so long?
Elastic Beanstalk consists of Applications and Environments components. Basically each environment runs only one application version at a time, however, you can run the same application version in many environments at the same time. Application versions and Saved configurations are part of the Application resource that's why it's possible to define it within the AWS::ElasticBeanstalk::Application resource properties. Environment however is a separate logical component of Elastic Beanstalk so it's impossible to declare it from within the Application resource.
For better readability I would suggest declaring all the resources separately as per this example. Also when using this approach you can directly reference the TemplateName and VersionLabel in the AWS::ElasticBeanstalk::Environment resource.
Alternatively if you want to stick to the github example you can adjust the above example to look like this:
{
"AWSTemplateFormatVersion": "2010-09-09",
"Resources": {
"sampleApplication": {
"Type": "AWS::ElasticBeanstalk::Application",
"Properties": {
"Description": "AWS Elastic Beanstalk Sample Application",
"ApplicationVersions": [{
"VersionLabel": "Initial Version",
"Description": "Initial Version",
"SourceBundle": {
"S3Bucket": {
"Fn::Sub": "elasticbeanstalk-samples-${AWS::Region}"
},
"S3Key": "php-newsample-app.zip"
}
}],
"ConfigurationTemplates": [{
"TemplateName": "DefaultConfiguration",
"Description": "AWS ElasticBeanstalk Sample Configuration Template",
"OptionSettings": [
{
"Namespace": "aws:autoscaling:asg",
"OptionName": "MinSize",
"Value": "2"
},
{
"Namespace": "aws:autoscaling:asg",
"OptionName": "MaxSize",
"Value": "6"
},
{
"Namespace": "aws:elasticbeanstalk:environment",
"OptionName": "EnvironmentType",
"Value": "LoadBalanced"
},
{
"Namespace": "aws:autoscaling:launchconfiguration",
"OptionName": "IamInstanceProfile",
"Value": {
"Ref": "MyInstanceProfile"
}
}
],
"SolutionStackName": "64bit Amazon Linux 2018.03 v2.9.11 running PHP 5.5"
}]
}
},
"sampleEnvironment": {
"Type": "AWS::ElasticBeanstalk::Environment",
"Properties": {
"ApplicationName": {
"Ref": "sampleApplication"
},
"Description": "AWS ElasticBeanstalk Sample Environment",
"TemplateName": "DefaultConfiguration",
"VersionLabel": "Initial Version"
}
},
"MyInstanceRole": {
"Type": "AWS::IAM::Role",
"Properties": {
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com"
]
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"Description": "Beanstalk EC2 role",
"ManagedPolicyArns": [
"arn:aws:iam::aws:policy/AWSElasticBeanstalkWebTier",
"arn:aws:iam::aws:policy/AWSElasticBeanstalkMulticontainerDocker",
"arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier"
]
}
},
"MyInstanceProfile": {
"Type": "AWS::IAM::InstanceProfile",
"Properties": {
"Roles": [
{
"Ref": "MyInstanceRole"
}
]
}
}
}
}
Just want to point out that AWS silently phased out the option of having the ApplicationVerions key under an AWS::ElasticBeanstalk::Application's Properties. It was still working in July 2022 but I noticed it stopped some time in August 2022, giving the error in the CloudFormation stack's Event tab:
Properties validation failed for resource TheEBAppResName with message: #: extraneous key [ApplicationVersions] is not permitted
where TheEBAppResName is the name of your AWS::ElasticBeanstalk::Application resource.
The only solution now is to follow the current AWS example and use a separate AWS::ElasticBeanstalk::ApplicationVersion resource.
Interestingly, I can't seem to find any documentation on the obsolete ApplicationVerions property anymore and the AWS blog that you linked to is no longer available, but I did find it cached on the Wayback machine. Even the earliest AWS doc on GitHub for AWS::ElasticBeanstalk::Application doesn't mention the ApplicationVerions property. Seems like AWS silently deprecated it sometime between when the blog was posted in April 2014 and that earliest GitHub doc page in December 2017, but didn't actually remove the option until last month, August 2022.

ElasticSearch not joining nodes in AWS Cluster

I am having issues with making clusters on AWS using ElasticSearch:
Software:
ES: elasticsearch-1.4.1.zip
AWS-Cloud: elasticsearch-cloud-aws/2.4.1
And that is being run on AWS EC2 Micro instance (Ubuntu 64). Both Instances use same security group with everything open, no restrictions at all
I have created two instances in us-west Oregon (us-west-2b) and I am using this configuration file:
{
"cluster.name": "mycluster",
"http": {
"cors.enabled" : true,
"cors.allow-origin": "*"
},
"node.name": "LosAngeles-node",
"node.master": "false",
"cloud": {
"aws": {
"access_key": "xxxxxxxxxxxx",
"secret_key": "xxxxxxxxxxxxxxxxxxxx",
"region": "us-west"
}
},
"discovery": {
"type": "ec2",
"ec2" : {
"groups": "esallaccess"
},
"zen": {
"ping": {
"multicast": {
"enabled": "false"
}
}
}
}
}
The LosAngeles node should be a work horse for the cluster, thus node.master = false.
When I start this node it constantly pings and never stops pinging, this is in the log after I start it:
...
[2014-11-28 15:18:30,593][TRACE][discovery.ec2 ] [LosAngeles-node] building dynamic
unicast discovery nodes...
[2014-11-28 15:18:30,593][DEBUG][discovery.ec2 ] [LosAngeles-node] using dynamic
discovery nodes []
[2014-11-28 15:18:32,170][TRACE][discovery.ec2 ] [LosAngeles-node] building dynamic
unicast discovery nodes...
[2014-11-28 15:18:32,170][DEBUG][discovery.ec2 ] [LosAngeles-node] using dynamic
discovery nodes []
[2014-11-28 15:18:32,170][TRACE][discovery.ec2 ] [LosAngeles-node] full ping responses:
{none}
[2014-11-28 15:18:32,170][DEBUG][discovery.ec2 ] [LosAngeles-node] filtered ping
responses: (filter_client[true], filter_data[false]) {none}
[2014-11-28 15:18:32,170][TRACE][discovery.ec2 ] [LosAngeles-node] starting to ping
...
enter code here
I am thinking this is problem with region. Any help is appreciated.
PS
Master node (NewYork) has the same configuration file with different name and node.master = true
Try to add master node address into the new node configuration.
In elasticsearch.yml
Verify the following parameters:
cluster.name: your-cluster-name
node.master: false
node.data: false
discovery.zen.ping.timeout: 3s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["your-master.dns.domain.com"]
If you use multicast, disable it. It doesn't work in AWS EC2
For any case check your Security Group.
It is required to allow your instances to acquire information about each other to discover available clusters in order for your nodes to find the cluster to join.
The AWS-cloud plugin automatically handles the joining of a node into a cluster once a master is nominated.
setting a Discovery Permission as a policy and apply it to your IAM role should fix this. Here goes the policy I used:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "whatever",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstances",
"ec2:DescribeRegions",
"ec2:DescribeSecurityGroups",
"ec2:DescribeTags"
],
"Resource": [
"*"
]
}
]
}