Error while launching the efs csi driver pod in EKS

Error while launching the efs csi driver pod in EKS - amazon-web-services

I am using aws EKS to deploy my application. There is a public subnet with a nodegroup in it. Then there is one private subnet with a node group as well. Everything was fine until I started using EFS. The pods for the efs csi driver would not launch on the private subnet node. The pod would give the following description:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m44s default-scheduler Successfully assigned kube-system/efs-csi-node-t79j6 to ip-192-168-93-186.ap-south-1.compute.internal
Normal Pulling 4m44s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0"
Warning Failed 2m28s kubelet Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0":
rpc error: code = Unknown desc = Error response from daemon: Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 2m28s kubelet Error: ErrImagePull
Normal Pulling 2m28s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0"
Warning Failed 13s kubelet Failed to pull image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0": rpc error: code = Unknown desc = Error response from daemon:
Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 13s kubelet Error: ErrImagePull
Normal Pulling 13s kubelet Pulling image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/livenessprobe:v2.0.0"
While another efs CSI driver pod launched successfully over the public subnet node. I thought it is not able to pull the image because of no internet connectivity on the private node but even after giving the internet gateway, i see the same issue. Please suggest some solution for this.
This was the command used for launching the driver pod :
kubectl apply -k
"github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/ecr/?ref=release-1.0"

All you need to do is to update the region , the repo is having the default region us-west-2, so if you are accessing the image from a different region you will be getting the imagepullbackoff error.
kubectl edit daemonsets/efs-csi-node -n kube-system
And then update it with your eks region for all the 3 images specified in the efs-csi daemonsets .

Related

Deploying JupyterHub on AWS EKS: Readiness probe failed, connection refused

I want to deploy JupyterHub on AWS' EKS service, and I am following the Zero to JupyterHub with Kubernetes guide to achieve this.
I am using the eksctl tool to deploy one cluster with a node group with one node that is represented by a t3.medium EC2 instance. After I deployed JupyterHub according to the instructions given in the guide, I get the following output when running kubectl get pods:
NAME READY STATUS RESTARTS AGE
continuous-image-puller-kl67x 1/1 Running 0 56s
hub-84b6467ff8-spjws 0/1 Pending 0 56s
proxy-79d75ddf8d-76rqm 1/1 Running 0 56s
user-scheduler-795f7d845f-7b8bn 1/1 Running 0 56s
user-scheduler-795f7d845f-mgks9 1/1 Running 0 56s
One pod, hub-84b6467ff8-spjws, remains in peding mode. kubectl describe pods outputs the following at the end:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m9s default-scheduler Successfully assigned jh/user-scheduler-795f7d845f-mgks9 to ip-192-168-20-191.eu-west-1.compute.internal
Normal Pulling 3m8s kubelet Pulling image "k8s.gcr.io/kube-scheduler:v1.23.10"
Normal Pulled 3m6s kubelet Successfully pulled image "k8s.gcr.io/kube-scheduler:v1.23.10" in 2.371033007s
Normal Created 3m3s kubelet Created container kube-scheduler
Normal Started 3m3s kubelet Started container kube-scheduler
Warning Unhealthy 3m3s kubelet Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused
I am having troubles understanding what "Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused" really means. I know there are similar questions relating to this, but so far their answers didn't work for me. I tried to have multiple nodes in the node group with nodes that have more storage, and I made sure the role have the right permission (according to the guide).
I am clearly missing something here, and I am more happy if someone could shed some light on this situation for me.

network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container issue in EKS. help anybody, please

When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:

AWS EKS - CSR Pending

I recently deploy an AWS EKS cluster with Managed nodes group. I notice when I run
kubectl get csr
All the CSRs listed are in Pending state. There are about 200 CSRs. When I try to SSH to pod it is giving "Error from server: error dialing backend: remote error: tls: internal error"
Do you guys face this issue?

Fail to init aws cluster (kubeadm init) with the message "could not init cloud provider "aws": error finding instance ... timeout

The issue I have is that kubeadm will never fully initialize. The output:
...
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
and journalctl -xeu kubelet shows the following interesting info:
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: W1203 17:54:08.017925 14709 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018044 14709 aws.go:1235] Building AWS cloudprovider
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018112 14709 aws.go:1195] Zone not specified in configuration file; querying AWS metadata service
Dec 03 17:56:08 ip-10-83-62-10.ec2.internal kubelet[14709]: F1203 17:56:08.332951 14709 server.go:265] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp 10.83.60.11:443: i/o timeout
The context is: it's a fully private AWS VPC. There is a proxy that is propagated to k8s manifests.
the kubeadm.yaml config is pretty innocent and looks like this
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
extraArgs:
cloud-provider: aws
clusterName: cdspidr
controlPlaneEndpoint: ip-10-83-62-10.ec2.internal
controllerManager:
extraArgs:
cloud-provider: aws
configure-cloud-routes: "false"
kubernetesVersion: stable
networking:
dnsDomain: cluster.local
podSubnet: 10.83.62.0/24
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
name: ip-10-83-62-10.ec2.internal
kubeletExtraArgs:
cloud-provider: was
I'm looking for help to figure out a couple of things here:
why does kubeadm use this address (https://ec2.us-east-1.amazonaws.com) to retrieve availability zones? It does not look correct. IMO, it should be something like http://169.254.169.254/latest/dynamic/instance-identity/document
why does it fail? With the same proxy settings, a curl request from the terminal returns the web page.
To workaround it, how can I specify availability zones on my own in kubeadm.yaml or via a command like for kubeadm?
I would appreciate any help or thoughts.

You can create a VPC endpoint for accessing Ec2 (service name - com.amazonaws.us-east-1.ec2), this will allow the kubelet to talk to Ec2 without internet and fetch the required info.
While creating the VPC endpoint please make sure to enable private DNS resolution option.
Also from the error it looks like that kubelet is trying to fetch the instance not just availability zone. ("aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances).

Endpoints are not being updated with new IP address of a pod

Platform: AWS EKS
Output of helm version:
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.2", GitCommit:"a8b13cc5ab6a7dbef0a58f5061bcc7c0c61598e7", GitTreeState:"clean"}
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud Provider/Platform (AKS, GKE, Minikube etc.): AWS EKS
The problem:
After jenkins pod restart, the pod got a new IP address and ReadinesProbe supposed to update the endpoints but it doesn't.
kubectl get endpoints
jenkins <none>
jenkins-agent <none>
Error:
Readiness probe failed: Get http://192.168.0.109:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can successfully access above URL from all pods and worker nodes and I get correct Headers.
This happened after helm failed to upgrade jenkins and then I rollback the release, and it was successful (apart from now not updating endpoints)
Now I need to edit endpoints manually to point Endpoints to the correct IP address of a pod.
Current ReadinesProbe from deployment is:
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Log from Jenkins pod is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m13s default-scheduler Successfully assigned default/jenkins-pod-id to <ip>.<region>.compute.internal
Normal SuccessfulAttachVolume 8m6s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins"
Normal Pulling 8m4s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m57s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m56s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m56s kubelet, <ip>.<region>.compute.internal Started container
Normal Pulling 7m43s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m42s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m42s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m42s kubelet, <ip>.<region>.compute.internal Started container
Warning Unhealthy 6m40s kubelet, <ip>.<region>.compute.internal Readiness probe failed: Get http://<IP>:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The pod got IP almost instantly but it takes a few minutes for the container to start. How can I get the ReadinesProbe to update Endpoints or even get ReadinesProbe logs? This is running in AWS so no access to the controller to get more logs.
If I update the endpoints fast enough, the ReadinesProbe won't fail but this doesn't help when the pod restarts next time.
Update:
Just enabled EKS logs and got this:
deployment_controller.go:484] Error syncing deployment default/jenkins: Operation cannot be fulfilled on deployments.apps "jenkins": the object has been modified; please apply your changes to the latest version and try again

Below helped. The Readiness probe is still failing but this is due to Jenkins taking 90s to start. I will update this.
helm delete jenkins
release "jenkins" deleted
helm rollback jenkins 25
Rollback was a success! Happy Helming!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js