AWS EKS NodeGroup "Create failed": Instances failed to join the kubernetes cluster - amazon-web-services

I am able to create an EKS cluster but when I try to add nodegroups, I receive a "Create failed" error with details:
"NodeCreationFailure": Instances failed to join the kubernetes cluster
I tried a variety of instance types and increasing larger volume sizes (60gb) w/o luck.
Looking at the EC2 instances, I only see the below problem. However, it is difficult to do anything since i'm not directly launching the EC2 instances (the EKS NodeGroup UI Wizard is doing that.)
How would one move forward given the failure happens even before I can jump into the ec2 machines and "fix" them?
Amazon Linux 2
Kernel 4.14.198-152.320.amzn2.x86_64 on an x86_64
ip-187-187-187-175 login: [ 54.474668] cloud-init[3182]: One of the
configured repositories failed (Unknown),
[ 54.475887] cloud-init[3182]: and yum doesn't have enough cached
data to continue. At this point the only
[ 54.478096] cloud-init[3182]: safe thing yum can do is fail. There
are a few ways to work "fix" this:
[ 54.480183] cloud-init[3182]: 1. Contact the upstream for the
repository and get them to fix the problem.
[ 54.483514] cloud-init[3182]: 2. Reconfigure the baseurl/etc. for
the repository, to point to a working
[ 54.485198] cloud-init[3182]: upstream. This is most often useful
if you are using a newer
[ 54.486906] cloud-init[3182]: distribution release than is
supported by the repository (and the
[ 54.488316] cloud-init[3182]: packages for the previous
distribution release still work).
[ 54.489660] cloud-init[3182]: 3. Run the command with the
repository temporarily disabled
[ 54.491045] cloud-init[3182]: yum --disablerepo= ...
[ 54.491285] cloud-init[3182]: 4. Disable the repository
permanently, so yum won't use it by default. Yum
[ 54.493407] cloud-init[3182]: will then just ignore the repository
until you permanently enable it
[ 54.495740] cloud-init[3182]: again or use --enablerepo for
temporary usage:
[ 54.495996] cloud-init[3182]: yum-config-manager --disable

Adding another reason to the list:
In my case the Nodes were running in a private subnets and I haven't configured a private endpoint under API server endpoint access.
After the update the nodes groups weren't updated automatically so I had to recreate them.

In my case, the problem was that I was deploying my node group in a private subnet, but this private subnet had no NAT gateway associated, hence no internet access. What I did was:
Create a NAT gateway
Create a new routetable with the following routes (the second one is the internet access route, through nat):
Destination: VPC-CIDR-block Target: local
Destination: 0.0.0.0/0 Target: NAT-gateway-id
Associate private subnet with the routetable created in the second-step.
After that, nodegroups joined the clusters without problem.

I noticed there was no answer here, but about 2k visits to this question over the last six months. There seems to be a number of reasons why you could be seeing these failures. To regurgitate the AWS documentation found here:
https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html
The aws-auth-cm.yaml file does not have the correct IAM role ARN for
your nodes. Ensure that the node IAM role ARN (not the instance
profile ARN) is specified in your aws-auth-cm.yaml file. For more
information, see Launching self-managed Amazon Linux nodes.
The ClusterName in your node AWS CloudFormation template does not
exactly match the name of the cluster you want your nodes to join.
Passing an incorrect value to this field results in an incorrect
configuration of the node's /var/lib/kubelet/kubeconfig file, and the
nodes will not join the cluster.
The node is not tagged as being owned by the cluster. Your nodes must
have the following tag applied to them, where is
replaced with the name of your cluster.
Key Value kubernetes.io/cluster/<cluster-name>
Value owned
The nodes may not be able to access the cluster using a public IP
address. Ensure that nodes deployed in public subnets are assigned a
public IP address. If not, you can associate an Elastic IP address to
a node after it's launched. For more information, see Associating an
Elastic IP address with a running instance or network interface. If
the public subnet is not set to automatically assign public IP
addresses to instances deployed to it, then we recommend enabling that
setting. For more information, see Modifying the public IPv4
addressing attribute for your subnet. If the node is deployed to a
private subnet, then the subnet must have a route to a NAT gateway
that has a public IP address assigned to it.
The STS endpoint for the Region that you're deploying the nodes to is
not enabled for your account. To enable the region, see Activating and
deactivating AWS STS in an AWS Region.
The worker node does not have a private DNS entry, resulting in the
kubelet log containing a node "" not found error. Ensure that the VPC
where the worker node is created has values set for domain-name and
domain-name-servers as Options in a DHCP options set. The default
values are domain-name:.compute.internal and
domain-name-servers:AmazonProvidedDNS. For more information, see DHCP
options sets in the Amazon VPC User Guide.
I myself had an issue with the tagging where I needed an uppercase letter. In reality, if you can use another avenue to deploy your EKS cluster I would recommend it (eksctl, aws cli, terraform even).

I will try to make the answer short by highlighting a few things that can go wrong in frontline.
1. Add the IAM role which is attached to EKS worker node, to the aws-auth config map in kube-system namespace. Ref
2. Login to the worker node which is created and failed to join the cluster. Try connecting to API server from inside using nc. Eg: nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443
3. If you are not using the EKS node from the drop down in AWS Console (which means you are using a LT or LC in the AWS EC2), dont forget to add the userdata section in the Launch template. Ref
set -o xtrace
/etc/eks/bootstrap.sh ${ClusterName} ${BootstrapArguments}
4. Check the EKS worker IAM node policy and see it has the appropriate permissions added. AmazonEKS_CNI_Policy is a must.
5. Your nodes must have the following tag applied to them, where cluster-name is replaced with the name of your cluster.
kubernetes.io/cluster/cluster-name: owned
I hope your problem lies within this list.
Ref: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html
https://aws.amazon.com/premiumsupport/knowledge-center/resolve-eks-node-failures/

Firstly, I had the NAT Gateway in my private subnet. Then I moved the NAT gateway back to public subnet which worked fine.
Terraform code is as follows:
resource "aws_internet_gateway" "gw" {
vpc_id = aws_vpc.dev-vpc.id
tags = {
Name = "dev-IG"
}
}
resource "aws_eip" "lb" {
depends_on = [aws_internet_gateway.gw]
vpc = true
}
resource "aws_nat_gateway" "natgw" {
allocation_id = aws_eip.lb.id
subnet_id = aws_subnet.dev-public-subnet.id
depends_on = [aws_internet_gateway.gw]
tags = {
Name = "gw NAT"
}
}

Try adding a tag to your private subnets where the worker nodes are deployed.
kubernetes.io/cluster/<cluster_name> = shared

we need to check what type of nat gateway we configured. It should be public one but in my case i configured as private.
Once i changed from private to public the issue resolved.

Auto Scaling group logs showed that we hit quote limit.
Launching a new EC2 instance. Status Reason: You've reached your quota for maximum Fleet Requests for this account. Launching EC2 instance failed.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/fleet-quotas.html

I had a similar issue and any provided solutions worked. After some investigation and running:
journalctl -f -u kubelet
in log I had:
Error: failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false
So naturally, the solution seems to disable swap with
swapoff -a
And then it worked fine, node has been registered and output was fine when checked with jounralctl and systemctl status kubelet .

The main problem is here network subnets they are public and private subnets. Check your Private subnets are added to NAT Gateway. If it is not added, add Private subnets to NAT Gateway and also check public subnets are attached to the Internet gateway.

Related

Why is communication from GKE to a private ip in GCP not working?

I have what I think is a reasonably straightforward setup in Google Cloud - A GKE cluster, a Cloud SQL instance, and a "Click-To-Deploy" Kafka VM instance.
All of the resources are in the same VPC, with firewall rules to allow all traffic to the internal VPC CIDR blocks.
The pods in the GKE cluster have no problem accessing the Cloud SQL instance via its private IP address. But they can't seem to access the Kafka instance via its private IP address:
# kafkacat -L -b 10.1.100.2
% ERROR: Failed to acquire metadata: Local: Broker transport failure
I've launched another VM manually into the VPC, and it has no problem connecting to the Kafka instance:
# kafkacat -L -b 10.1.100.2
Metadata for all topics (from broker -1: 10.1.100.2:9092/bootstrap):
1 brokers:
broker 0 at ....us-east1-b.c.....internal:9092
1 topics:
topic "notifications" with 1 partitions:
partition 0, leader 0, replicas: 0, isrs: 0
I can't seem to see any real difference in the networking between the containers in GKE and the manually launched VM, especially since both can access the Cloud SQL instance at 10.10.0.3.
Where do I go looking for what's blocking the connection?
I have seen that the error is relate to the network,
however if you are using gke on the same VPC network, you will ensure to configure properly the Internal Load Balancer, also I saw that this product or feature is BETA version, this means that it is not yet guaranteed to work as expected, another suggestion is that you ensure that you are not using any policy, that maybe block the connection, I found the next article on the community that maybe help you to solve it
This gave me what I needed: https://serverfault.com/a/924317
The networking rules in GCP still seem wonky to me coming from a long time working with AWS. I had rules that allowed anything in the VPC CIDR blocks to contact anything else in those same CIDR blocks, but that wasn't enough. Explicitly adding the worker nodes subnet as a source for a new rule opened it up.

Not able to update EC2 Linux instance with command 'sudo yum update'

When I try to update EC2 Amazon Linux instance, I get following error:
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Could not retrieve mirrorlist http://amazonlinux.ap-south-1.amazonaws.com/2/core /latest/x86_64/mirror.list error was
12: Timeout on http://amazonlinux.ap-south-1.amazonaws.com/2/core/latest/x86_64/ mirror.list: (28, 'Connection timed out after 5000 milliseconds')
One of the configured repositories failed (Unknown),
and yum doesn't have enough cached data to continue. At this point the only
safe thing yum can do is fail. There are a few ways to work "fix" this:
1. Contact the upstream for the repository and get them to fix the problem.
2. Reconfigure the baseurl/etc. for the repository, to point to a working
upstream. This is most often useful if you are using a newer
distribution release than is supported by the repository (and the
packages for the previous distribution release still work).
3. Run the command with the repository temporarily disabled
yum --disablerepo=<repoid> ...
4. Disable the repository permanently, so yum won't use it by default. Yum
will then just ignore the repository until you permanently enable it
again or use --enablerepo for temporary usage:
yum-config-manager --disable <repoid>
or
subscription-manager repos --disable=<repoid>
5. Configure the failing repository to be skipped, if it is unavailable.
Note that yum will try to contact the repo. when it runs most commands,
so will have to try and fail each time (and thus. yum will be be much
slower). If it is a very temporary problem though, this is often a nice
compromise:
yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Any help would be much appreciated.
Your instance does not have access to internet.
You can resolve this in following ways:
If your instance is running in a public subnet make sure it has a public ip attached. Also check if the route table for the public subnet is associated with this subnet and has a route 0.0.0.0/0 pointing to internet gateway.
If you are running your instance in private make sure you have created the NAT Gateway in a public subnet. Check the route table has a route 0.0.0.0/0 pointing to NAT and the subnet is associated with the private route table.
Check if the security group associated with instance has outbound traffic enabled.
You are probably in a private subnet (ie a subnet without a 0.0.0.0/0 route to the outside world).
If you want to connect to the outside world, you need a NAT gatway in a public subnet, which has a route to an Internet Gateway.
EC2 -> NAT -> IGW
This is the best AWS troubleshooting page I've found (early 2021)
If you don't want to connect to the outside world, you need a VPC endpoint which allows connectivity to specific AWS services from a private subnet. I have never got this to work.
Verify that the security group attached to the instance is allowing all inbound and outbound connections.
I don't know what specific network protocol is needed for these updates, but public SSH, HTTP, and HTTPS weren't enough for me. So I simply allowed all traffic for a brief time to run the updates.
(I'm guessing it might have simply needed an FTP port open, but I didn't experiment long enough to find out. Feel free to edit this answer if you know specifically which ports are needed for yum updates on EC2 instances.)
If you have an S3 endpoint on your subnet route table then this will cause yum to fail. To fix this please try to add the following policy to the S3 endpoint:
{
"Statement": [
{
"Sid": "Amazon Linux AMI Repository Access",
"Principal": "*",
"Action": [
"s3:GetObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::packages.*.amazonaws.com/*",
"arn:aws:s3:::repo.*.amazonaws.com/*"
]
}
]
}

GCE/GKE NAT gateway route kills ssh connection

im trying to setUp a NAT Gateway for Kubernetes Nodes on the GKE/GCE.
I followed the instructions on the Tutorial (https://cloud.google.com/vpc/docs/special-configurations chapter: "Configure an instance as a NAT gateway") and also tried the tutorial with terraform (https://github.com/GoogleCloudPlatform/terraform-google-nat-gateway)
But at both Tutorials (even on new created google-projects) i get the same two errors:
The NAT isn't working at all. Traffic still outgoing over nodes.
I can't ssh into my gke-nodes -> timeout. I already tried setting up a rule with priority 100 that allows all tcp:22 traffic.
As soon as i tag the gke-node-instances, so that the configured route applies to them, the SSH connection is no longer possible.
You've already found the solution to the first problem: tag the nodes with the correct tag, or manually create a route targeting the instance group that is managing your GKE nodes.
Regarding the SSH issue:
This is answered under "Caveats" in the README for the NAT Gateway for GKE example in the terraform tutorial repo you linked (reproduced here to comply with StackOverflow rules).
The web console mentioned below uses the same ssh mechanism as kubectl exec internally. The short version is that as of time of posting it's not possible to both route all egress traffic through a NAT gateway and use kubectl exec to interact with pods running on a cluster.
Update # 2018-09-25:
There is a workaround available if you only need to route specific traffic through the NAT gateway, for example, if you have a third party whose service requires whitelisting your IP address in their firewall.
Note that this workaround requires strong alerting and monitoring on your part as things will break if your vendor's public IP changes.
If you specify a strict destination IP range when creating your Route in GCP then only traffic bound for those addresses will be routed through the NAT Gateway. In our case we have several routes defined in our VPC network routing table, one for each of our vendor's public IP addresses.
In this case the various kubectl commands including exec and logs will continue to work as expected.
A potential workaround is to use the command in the snippet below to connect to a node and use docker exec on the node to enter a container. This of course means you will need to first locate the node your pod is running on before jumping through the gateway onto the node and running docker exec.
Caveats
The web console SSH will no longer work, you have to jump through the NAT gateway machine to SSH into a GKE node:
eval ssh-agent $SHELL
ssh-add ~/.ssh/google_compute_engine
CLUSTER_NAME=dev
REGION=us-central1
gcloud compute ssh $(gcloud compute instances list --filter=name~nat-gateway-${REGION} --uri) --ssh-flag="-A" -- ssh $(gcloud compute instances list --filter=name~gke-${CLUSTER_NAME}- --limit=1 --format='value(name)') -o StrictHostKeyChecking=no
Source: https://github.com/GoogleCloudPlatform/terraform-google-nat-gateway/tree/master/examples/gke-nat-gateway
You can use kubeip in order to assign IP addresses
https://blog.doit-intl.com/kubeip-automatically-assign-external-static-ips-to-your-gke-nodes-for-easier-whitelisting-without-2068eb9c14cd

Unable to validate Kubernetes cluster using Kops

I am new to Kubernetes. I am using Kops to deploy my Kubernetes application on AWS. I have already registered my domain on AWS and also created a hosted zone and attached it to my default VPC.
Creating my Kubernetes cluster through kops succeeds. However, when I try to validate my cluster using kops validate cluster, it fails with the following error:
unable to resolve Kubernetes cluster API URL dns: lookup api.ucla.dt-api-k8s.com on 149.142.35.46:53: no such host
I have tried debugging this error but failed. Can you please help me out? I am very frustrated now.
From what you describe, you created a Private Hosted Zone in Route 53. The validation is probably failing because Kops is trying to access the cluster API from your machine, which is outside the VPC, but private hosted zones only respond to requests coming from within the VPC. Specifically, the hostname api.ucla.dt-api-k8s.com is where the Kubernetes API lives, and is the means by which you can communicate and issue commands to the cluster from your computer. Private Hosted Zones wouldn't allow you to access this API from the outside world (your computer).
A way to resolve this is to make your hosted zone public. Kops will automatically create a VPC for you (unless configured otherwise), but you can still access the API from your computer.
I encountered this last night using a kops-based cluster creation script that had worked previously. I thought maybe switching regions would help, but it didn't. This morning it is working again. This feels like an intermittency on the AWS side.
So the answer I'm suggesting is:
When this happens, you may need to give it a few hours to resolve itself. In my case, I rebuilt the cluster from scratch after waiting overnight. I don't know whether or not it was necessary to start from scratch -- I hope not.
This is all I had to run:
kops export kubecfg (cluster name) --admin
This imports the "new" kubeconfig needed to access the kops cluster.
I came across this problem with an ubuntu box. What I did was to add the dns record in the hosted zone in route 53 to /etc/hosts.
Here is how I resolved the issue :
Looks like there is a bug with kops library though it shows
**Validation failed: unexpected error during validation: unable to resolve Kubernetes cluster API URL dns: lookup api **
when u try kops validate cluster post waiting for 10-15 mins. Behind the scene the kubernetes cluster is up ! You can verify same by doing ssh in to master node of your kunernetes cluster as below
Go to page where u can ec2 instance and your k8's instances running
copy "Public IPv4 address" of your master k8 node
post login to ec2 instance on command prompt login to master node as below
ssh ubuntu#<<"Public IPv4 address" of your master k8 node>>
Verify if you can see all node of k8 cluster with below command it should show your master node and worker node listed there
kubectl get nodes --all-namespaces

Can a bastion be assigned a specific AWS Elastic IP with Terraform?

We need to whitelist some Elastic IPs from the corporate firewall as allowed destination IPs for SSH. Is there a way to configure a bastion instance with Terraform and assign it a specific Elastic IP? And, likewise, have it return that EIP to the provisioned pool when the bastion is destroyed? Obviously, we don't want EIPs to be deallocated from our AWS account.
Existing answer is outdated. Associating existing Elastic IPs is now possible thanks to this change: https://github.com/hashicorp/terraform/pull/5236
Docs: https://www.terraform.io/docs/providers/aws/r/eip_association.html
Excerpt:
aws_eip_association
Provides an AWS EIP Association as a top level
resource, to associate and disassociate Elastic IPs from AWS Instances
and Network Interfaces.
NOTE: aws_eip_association is useful in scenarios where EIPs are either
pre-existing or distributed to customers or users and therefore cannot
be changed.
Currently Terraform only supports attaching Elastic IPs to EC2 instances upon EIP creation when you can choose to optionally attach it to an instance or an Elastic Network Interface. NAT Gateways currently allow you to associate an EIP with it upon the NAT Gateway being created but that's a slightly special case.
The instance module itself only allows a boolean choice of whether the instance gets a normal public IP address or not. There's a GitHub issue around allowing instances to be associated with pre-existing EIPs but at the time of writing no pull request to support it.
If it's simply a case of wanting to open up a port on your corporate firewall once and not having to touch it for a bastion box that is torn down regularly and you're open to allowing Terraform to create and manage the EIP for you then you could do something like the following:
resource "aws_instance" "bastion" {
ami = "ami-abcdef12"
instance_type = "t2.micro"
tags {
Name = "bastion"
}
}
output "bastion_id" {
value = "${aws_instance.bastion.id}"
}
And in a separate folder altogether you could have your EIP definition and also lookup the outputted instance ID from a remote state file for the bastion host and use that when applying the EIP:
resource "terraform_remote_state" "remote_state" {
backend = "s3"
config {
bucket = "mybucketname"
key = "name_of_key_file"
}
}
resource "aws_eip" "bastion_eip" {
vpc = true
instance = "${terraform_remote_state.remote_state.output.bastion_id}"
lifecycle {
prevent_destroy = true
}
}
In the above example I've used #BMW's approach so that you should get an error in any plan that attempts to destroy the EIP just as a fail safe.
This at least should allow you to use Terraform to build and destroy short lived instances but apply the same EIP to the instance each time so you don't have to change anything on your firewall.
A slightly simpler approach using just Terraform would be to put the EIP definition in the same .tf file/folder as the bastion instance but you would be unable to use Terraform to destroy anything in that folder (including the bastion instance itself) if you kept the lifecycle configuration block as it simply causes an error during the plan. Removing the block simply gets you back to destroying the EIP everytime you destroy the instance.
I spent some time working through this problem and found the other answers helpful, but incomplete.
For those people trying to reallocate an AWS elastic IP using Terraform, we can do so using a combination of terraform_remote_state and the aws_eip_association. Let me explain.
We should use two separate root modules, themselves within a parent folder:
parent_folder
├--elasticip
| └main.tf
└--server
└main.tf
In elasticip/main.tf you can use the following code which will create an elastic IP, and store the state in a local backend so that you can access its output from the server module. The output variable name cannot be 'id', as this will clash with the remote state variable id and it will not work. Just use a different name, such as eip_id.
terraform {
backend "local" {
path = "../terraform-eip.tfstate"
}
}
resource "aws_eip" "main" {
vpc = true
lifecycle {
prevent_destroy = true
}
}
output "eip_id" {
value = "${aws_eip.main.id}"
}
Then in server/main.tf the following code will create a server and associate the elastic IP with it.
data "terraform_remote_state" "eip" {
backend = "local"
config = {
path = "../terraform-eip.tfstate"
}
}
resource "aws_eip_association" "eip_assoc" {
instance_id = "${aws_instance.web.id}"
allocation_id = "${data.terraform_remote_state.eip.eip_id}"
#For >= 0.12
#allocation_id = "${data.terraform_remote_state.eip.outputs.eip_id}"
}
resource "aws_instance" "web" {
ami = "insert-your-AMI-ref"
}
With that all set up, you can go into the elasticip folder, run terraform init, and terraform apply to get your elastic IP. Then go into the server folder, and run the same two commands to get your server with its associated elastic IP. From within the server folder you can run terraform destroy and terraform apply and the new server will get the same elastic IP.
we don't want EIPs to be deallocated from our AWS account.
yes, you can block it. set prevent_destroy to true
resource "aws_eip" "bastion_eip" {
count = "${var.num_bastion}"
lifecycle {
prevent_destroy = true
}
}
Regarding EIP assigned, please refer #ydaetskcoR's reply
If you're using an autoscaling group you can do it in the user data. https://forums.aws.amazon.com/thread.jspa?threadID=52601
#!/bin/bash
# configure AWS
aws configure set aws_access_key_id {MY_ACCESS_KEY}
aws configure set aws_secret_access_key {MY_SECRET_KEY}
aws configure set region {MY_REGION}
# associate Elastic IP
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ALLOCATION_ID={MY_EIP_ALLOC_ID}
aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOCATION_ID --allow-reassociation