I am using aws EKS to deploy my application. There is a public subnet with a nodegroup in it. Then there is one private subnet with a node group as well. Everything was fine until I started using EFS. The pods for the efs csi driver would not launch on the private subnet node. The pod would give the following description:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m44s default-scheduler Successfully assigned kube-system/efs-csi-node-t79j6 to ip-192-168-93-186.ap-south-1.compute.internal
Normal Pulling 4m44s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0"
Warning Failed 2m28s kubelet Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0":
rpc error: code = Unknown desc = Error response from daemon: Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 2m28s kubelet Error: ErrImagePull
Normal Pulling 2m28s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0"
Warning Failed 13s kubelet Failed to pull image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0": rpc error: code = Unknown desc = Error response from daemon:
Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 13s kubelet Error: ErrImagePull
Normal Pulling 13s kubelet Pulling image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/livenessprobe:v2.0.0"
While another efs CSI driver pod launched successfully over the public subnet node. I thought it is not able to pull the image because of no internet connectivity on the private node but even after giving the internet gateway, i see the same issue. Please suggest some solution for this.
This was the command used for launching the driver pod :
kubectl apply -k
"github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/ecr/?ref=release-1.0"
All you need to do is to update the region , the repo is having the default region us-west-2, so if you are accessing the image from a different region you will be getting the imagepullbackoff error.
kubectl edit daemonsets/efs-csi-node -n kube-system
And then update it with your eks region for all the 3 images specified in the efs-csi daemonsets .
I recently deploy an AWS EKS cluster with Managed nodes group. I notice when I run
kubectl get csr
All the CSRs listed are in Pending state. There are about 200 CSRs. When I try to SSH to pod it is giving "Error from server: error dialing backend: remote error: tls: internal error"
Do you guys face this issue?
Platform: AWS EKS
Output of helm version:
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.2", GitCommit:"a8b13cc5ab6a7dbef0a58f5061bcc7c0c61598e7", GitTreeState:"clean"}
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud Provider/Platform (AKS, GKE, Minikube etc.): AWS EKS
The problem:
After jenkins pod restart, the pod got a new IP address and ReadinesProbe supposed to update the endpoints but it doesn't.
kubectl get endpoints
jenkins <none>
jenkins-agent <none>
Error:
Readiness probe failed: Get http://192.168.0.109:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can successfully access above URL from all pods and worker nodes and I get correct Headers.
This happened after helm failed to upgrade jenkins and then I rollback the release, and it was successful (apart from now not updating endpoints)
Now I need to edit endpoints manually to point Endpoints to the correct IP address of a pod.
Current ReadinesProbe from deployment is:
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Log from Jenkins pod is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m13s default-scheduler Successfully assigned default/jenkins-pod-id to <ip>.<region>.compute.internal
Normal SuccessfulAttachVolume 8m6s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins"
Normal Pulling 8m4s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m57s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m56s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m56s kubelet, <ip>.<region>.compute.internal Started container
Normal Pulling 7m43s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m42s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m42s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m42s kubelet, <ip>.<region>.compute.internal Started container
Warning Unhealthy 6m40s kubelet, <ip>.<region>.compute.internal Readiness probe failed: Get http://<IP>:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The pod got IP almost instantly but it takes a few minutes for the container to start. How can I get the ReadinesProbe to update Endpoints or even get ReadinesProbe logs? This is running in AWS so no access to the controller to get more logs.
If I update the endpoints fast enough, the ReadinesProbe won't fail but this doesn't help when the pod restarts next time.
Update:
Just enabled EKS logs and got this:
deployment_controller.go:484] Error syncing deployment default/jenkins: Operation cannot be fulfilled on deployments.apps "jenkins": the object has been modified; please apply your changes to the latest version and try again
Below helped. The Readiness probe is still failing but this is due to Jenkins taking 90s to start. I will update this.
helm delete jenkins
release "jenkins" deleted
helm rollback jenkins 25
Rollback was a success! Happy Helming!
I started a kubernetes cluster in AWS using the AWS Heptio-Kubernetes Quickstart about a month ago. I had been merrily installing applications onto it until recently when I noticed that some of my pods weren't behaving correctly, and some were stuck in "terminating" status or wouldn't initialize.
After reading through some of the troubleshooting guides I realized that so of the core system pods in the "kube-system" namespace were not running: kube-apiserver, kube-controller-manager, and kube-scheduler. This would explain why my deployments were no longer scaling as expected and why terminating pods will not delete. I can however still run commands and view cluster status with kubectl. See the screenshot below:
Not sure where to start to mitigate this. I've tried rebooting the server, I've stopped and restarted kubeadm with systemctl, and I've tried manually deleting the pods in /var/lib/kubelet/pods. Any help is greatly appreciated.
EDIT: I just realized some of my traffic might be blocked by the container security tool we installed on our worker nodes called Twistlock. I will consult with them as it may be blocking connectivity on the nodes.
I realized it might be connectivity issues when gathering logs for each of the kubernetes pods, see below for log excerpts ( i have redacted the IPs):
kubectl logs kube-controller-manager-ip-*************.us-east-2.compute.internal -n kube-system
E0723 18:33:37.056730 1 route_controller.go:117] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: kubernetes
kubectl -n kube-system logs kube-apiserver-ip-***************.us-east-2.compute.internal
I0723 18:38:23.380163 1 logs.go:49] http: TLS handshake error from ********: EOF
I0723 18:38:27.511654 1 logs.go:49] http: TLS handshake error from ********: EOF
kubectl -n kube-system logs kube-scheduler-ip-*******.us-east-2.compute.internal
E0723 15:31:54.397921 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.ReplicaSet: Get https://**********:6443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp ************: getsockopt: connection refused
E0723 15:31:54.398008 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Node: Get https://*********/api/v1/nodes?limit=500&resourceVersion=0: dial tcp ********:6443: getsockopt: connection refused
E0723 15:31:54.398075 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.ReplicationController: Get https://************8:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
E0723 15:31:54.398207 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Service: Get https://************:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
Edit: After contacting our Twistlock vendors I have verified that the connectivity issues are not due to Twistlock as there are no policies set in place to actually block or isolate the containers yet. My issue with the cluster still stands.
So I'm trying to create a pod on a Kubernetes cluster. Here is the yml file from which I am creating the pod.
kind: Pod
apiVersion: v1
metadata:
name: task-pv-pod2
spec:
containers:
- name: task-pv-container2
image: <<image_name>>
The pod hangs at container creating. Here is the output of kubectl describe pod.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
10s 10s 1 default-scheduler Normal Scheduled Successfully assigned task-pv-pod2 to ip-10-205-234-170.ec2.internal
8s 8s 1 kubelet, ip-10-205-234-170.ec2.internal Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "task-pv-pod2_default" with SetupNetworkError: "NetworkPlugin cni failed to set up pod \"task-pv-pod2_default\" network: client: etcd cluster is unavailable or misconfigured; error #0: x509: cannot validate certificate for 10.205.234.170 because it doesn't contain any IP SANs\n; error #1: x509: cannot validate certificate for 10.205.235.160 because it doesn't contain any IP SANs\n; error #2: x509: cannot validate certificate for 10.205.234.162 because it doesn't contain any IP SANs\n"
7s 6s 2 kubelet, ip-10-205-234-170.ec2.internal Warning FailedSync Error syncing pod, skipping: failed to "TeardownNetwork" for "task-pv-pod2_default" with TeardownNetworkError: "NetworkPlugin cni failed to teardown pod \"task-pv-pod2_default\" network: client: etcd cluster is unavailable or misconfigured; error #0: x509: cannot validate certificate for 10.205.234.170 because it doesn't contain any IP SANs\n; error #1: x509: cannot validate certificate for 10.205.235.160 because it doesn't contain any IP SANs\n; error #2: x509: cannot validate certificate for 10.205.234.162 because it doesn't contain any IP SANs\n"
Does anyone know what might be causing this. In order to Kubernetes to work with aws as a cloud provider I had to set a proxy variable in the hyperkube container. Co
It seem your ETCD's certs as not trusted for the name (or IP) you access on. I suggest to you to check your cluster health with kubectl get cs and modify k8s way to talk to ETCD if needed.