EKS: Unhealthy nodes in the kubernetes cluster - amazon-web-services

I’m getting an error when using terraform to provision node group on AWS EKS.
Error: error waiting for EKS Node Group (xxx) creation: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster.
And I went to console and inspected the node. There is a message “runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker network plugin is not ready: cni config uninitialized”.
I have 5 private subnets and connect to Internet via NAT.
Is someone able to give me some hint on how to debug this?
Here are some details on my env.
Kubernetes version: 1.18
Platform version: eks.3
AMI type: AL2_x86_64
AMI release version: 1.18.9-20201211
Instance types: m5.xlarge
There are three workloads set up in the cluster.
coredns, STATUS (2 Desired, 0 Available, 0 Ready)
aws-node STATUS (5 Desired, 5 Scheduled, 0 Available, 0 Ready)
kube-proxy STATUS (5 Desired, 5 Scheduled, 5 Available, 5 Ready)
go inside the coredns, both pods are in pending state, and conditions has “Available=False, Deployment does not have minimum availability” and “Progress=False, ReplicaSet xxx has timed out progressing”
go inside the one of the pod in aws-node, the status shows “Waiting - CrashLoopBackOff”

Add pod network add-on
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml

Related

EKS cluster upgrade fail with Kubelet version of Fargate pods must be updated to match cluster version

I have an EKS cluster v1.23 with Fargate nodes. Cluster and Nodes are in v1.23.x
$ kubectl version --short
Server Version: v1.23.14-eks-ffeb93d
Fargate nodes are also in v1.23.14
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fargate-ip-x-x-x-x.region.compute.internal Ready <none> 7m30s v1.23.14-eks-a1bebd3
fargate-ip-x-x-x-xx.region.compute.internal Ready <none> 7m11s v1.23.14-eks-a1bebd3
When I tried to upgrade cluster to 1.24 from AWS console, it gives this error.
Kubelet version of Fargate pods must be updated to match cluster version 1.23 before updating cluster version; Please recycle all offending pod replicas
What are the other things I have to check?
Fargate nodes are also in v1.23.14
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fargate-ip-x-x-x-x.region.compute.internal Ready <none> 7m30s v1.23.14-eks-a1bebd3
fargate-ip-x-x-x-xx.region.compute.internal Ready <none> 7m11s v1.23.14-eks-a1bebd3
From your question you only have 2 nodes, likely you are running only the coredns. Try kubectl scale deployment coredns --namespace kube-system --replicas 0 then upgrade. You can scale it back to 2 when the control plane upgrade is completed. Nevertheless, ensure you have selected the correct cluster on the console.

EKS Connector Pods stuck in Init:CrashLoopBackOff

I have a single node kubernetes cluster setup on AWS,I am currently running a VPC with one public and private subnet.
The master node is in the public subnet and worker node is in the private subnet.
So on the AWS console I can succesfuly register a cluster and download the connector manifest which, I then download and apply the manifest on my master node but unfortunately the pods don't start. the below is what i observered.
kubectl get pods
NAME               READY      STATUS              RESTARTS            AGE
eks-connector-0   0/2  Init:CrashLoopBackOff   7 (4m36s ago)       19m
kubectl logs ejs-connector-0
Defaulted container "connector-agent" out of: connector-agent, connector-proxy, connector-init (init)
Error from server (BadRequest): container "connector-agent" in pod "eks-connector-0" is waiting to start: PodInitializing
The pods are failing to start with th above logged errors.
I would suggest providing output of kubectl get pod eks-connector-0 -o yaml and kubectl logs -p eks-connector-0

network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container issue in EKS. help anybody, please

When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:

AWS EKS fargate coredns ImagePullBackOff

I'm trying to deploy a simple tutorial app to a new fargate based kubernetes cluster.
Unfortunately I'm stuck on ImagePullBackOff for the coredns pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled 5m51s fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled 4m11s fargate-scheduler Successfully assigned kube-system/coredns-86cb968586-mcdpj to fargate-ip-172-31-55-205.eu-central-1.compute.internal
Warning Failed 100s kubelet Failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": rpc error: code = Unknown desc = failed to pull and unpack image "602
401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to do request: Head "https://602401143452.dkr.
ecr.eu-central-1.amazonaws.com/v2/eks/coredns/manifests/v1.8.0-eksbuild.1": dial tcp 3.122.9.124:443: i/o timeout
Warning Failed 100s kubelet Error: ErrImagePull
Normal BackOff 99s kubelet Back-off pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
Warning Failed 99s kubelet Error: ImagePullBackOff
Normal Pulling 87s (x2 over 4m10s) kubelet Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
While googling I found https://aws.amazon.com/premiumsupport/knowledge-center/eks-ecr-troubleshooting/
It contains a following list:
To resolve this error, confirm the following:
- The subnet for your worker node has a route to the internet. Check the route table associated with your subnet.
- The security group associated with your worker node allows outbound internet traffic.
- The ingress and egress rule for your network access control lists (ACLs) allows access to the internet.
Since I've created both my private subnets as well as their NAT Gateways manually I tried to locate an issue here but couldn't find anything. They as well as security groups and ACLs look fine to me.
I even added the AmazonEC2ContainerRegistryReadOnly to my EKS role but after issuing command kubectl rollout restart -n kube-system deployment coredns the result is unfortunately the same: ImagePullBackOff
Unfortunately I've runned out of ideas and I'm stuck. Any help that would help me troubleshoot this would be greatly appreciated. ~Thanks
edit>
After creating new cluster via *eksctl as #mreferre suggested in his comment I get RBAC error with link: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting_iam.html#security-iam-troubleshoot-cannot-view-nodes-or-workloads
I'm not sure what is going on since I already have
edit>>
The cluster created via AWS Console ( web interface ) doesn't have the configmap aws-auth I've retrieved the configmap below using command kubectl edit configmap aws-auth -n kube-system
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::370179080679:role/eksctl-tutorial-cluster-FargatePodExecutionRole-1J605HWNTGS2Q
username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
creationTimestamp: "2021-04-08T18:42:59Z"
name: aws-auth
namespace: kube-system
resourceVersion: "918"
selfLink: /api/v1/namespaces/kube-system/configmaps/aws-auth
uid: d9a21964-a8bf-49e9-800f-650320b7444e
Creating an answer to sum up the discussion in the comment that deemed to be acceptable. The most common (and arguably easier) way to setup an EKS cluster with Fargate support is to use EKSCTL and setup the cluster using eksctl create cluster --fargate. This will build all the plumbing for you and you will get a cluster with no EC2 instances nor managed node groups with the two CoreDNS pods deployed on two Fargate instances. Note that when you deploy EKSCTL via the command line you may end up using different roles/users between your CLI and console. This may result in access denied issues. Best course of action would be to use a non-root user to login into the AWS console and use CloudShell to deploy with EKSCTL (CloudShell will inherit the same console user identity). {More info in the comments}

AWS ECS service Tasks getting replaced with (reason Request timed out)

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.
Upgraded Docker storage driver from devicemapper to overlay2
We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.
We increase health check grace period for the service from 0 to 240secs
Increased KeepAliveTimeout and SocketTimeout to 180 secs
Enabled awslogs on containers instead of stdout, but there was no unusual behavior
Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.
Enabled container insights for better container level debugging
Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.
The amount of errors have come down amazingly with these two but still we are getting this issue once a while.
We have seen all the graphs related to instance and container which went down below are the logs for it:
ECS container insights logs for victim container :
Query :
fields CpuUtilized, MemoryUtilized, #message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"
Example Logs answered :
{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}
None of logs were having CPU and Memory utilised ridiculously high.
We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins
Our health check configurations are below :
Protocol HTTP
Path /healthcheck
Port traffic port
Healthy threshold 10
Unhealthy threshold 2
Timeout 5
Interval 10
Success codes 200
Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :
docker info
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.
Your steps from 1 to 7 almost no thing do with the error.
service example-service (instance i-016b0a460d9974567) (port 1047) is
unhealthy in target-group example-service due to (reason Request timed
out)
The error is very clear, you ECS service is not reachable to Load balancer health check.
Target Group Unhealthy
When this is the case, go straight and check the container SG, Port, application status or health status code.
Possible reason
There might be the case, there is no route Path /healthcheck in the backend service
The status code from /healthcheck is not 200
Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
The security group is not allowing traffic on the target group
Application is not running in the container
These are the possible reason when there is a timeout from health check.
I faced the same issue of ( Reason request timeout ).
I managed to solve it by updating my security-group inbound rules.
Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.