Platform: AWS EKS
Output of helm version:
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.2", GitCommit:"a8b13cc5ab6a7dbef0a58f5061bcc7c0c61598e7", GitTreeState:"clean"}
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud Provider/Platform (AKS, GKE, Minikube etc.): AWS EKS
The problem:
After jenkins pod restart, the pod got a new IP address and ReadinesProbe supposed to update the endpoints but it doesn't.
kubectl get endpoints
jenkins <none>
jenkins-agent <none>
Error:
Readiness probe failed: Get http://192.168.0.109:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can successfully access above URL from all pods and worker nodes and I get correct Headers.
This happened after helm failed to upgrade jenkins and then I rollback the release, and it was successful (apart from now not updating endpoints)
Now I need to edit endpoints manually to point Endpoints to the correct IP address of a pod.
Current ReadinesProbe from deployment is:
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Log from Jenkins pod is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m13s default-scheduler Successfully assigned default/jenkins-pod-id to <ip>.<region>.compute.internal
Normal SuccessfulAttachVolume 8m6s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins"
Normal Pulling 8m4s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m57s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m56s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m56s kubelet, <ip>.<region>.compute.internal Started container
Normal Pulling 7m43s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m42s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m42s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m42s kubelet, <ip>.<region>.compute.internal Started container
Warning Unhealthy 6m40s kubelet, <ip>.<region>.compute.internal Readiness probe failed: Get http://<IP>:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The pod got IP almost instantly but it takes a few minutes for the container to start. How can I get the ReadinesProbe to update Endpoints or even get ReadinesProbe logs? This is running in AWS so no access to the controller to get more logs.
If I update the endpoints fast enough, the ReadinesProbe won't fail but this doesn't help when the pod restarts next time.
Update:
Just enabled EKS logs and got this:
deployment_controller.go:484] Error syncing deployment default/jenkins: Operation cannot be fulfilled on deployments.apps "jenkins": the object has been modified; please apply your changes to the latest version and try again
Below helped. The Readiness probe is still failing but this is due to Jenkins taking 90s to start. I will update this.
helm delete jenkins
release "jenkins" deleted
helm rollback jenkins 25
Rollback was a success! Happy Helming!
Related
I want to deploy JupyterHub on AWS' EKS service, and I am following the Zero to JupyterHub with Kubernetes guide to achieve this.
I am using the eksctl tool to deploy one cluster with a node group with one node that is represented by a t3.medium EC2 instance. After I deployed JupyterHub according to the instructions given in the guide, I get the following output when running kubectl get pods:
NAME READY STATUS RESTARTS AGE
continuous-image-puller-kl67x 1/1 Running 0 56s
hub-84b6467ff8-spjws 0/1 Pending 0 56s
proxy-79d75ddf8d-76rqm 1/1 Running 0 56s
user-scheduler-795f7d845f-7b8bn 1/1 Running 0 56s
user-scheduler-795f7d845f-mgks9 1/1 Running 0 56s
One pod, hub-84b6467ff8-spjws, remains in peding mode. kubectl describe pods outputs the following at the end:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m9s default-scheduler Successfully assigned jh/user-scheduler-795f7d845f-mgks9 to ip-192-168-20-191.eu-west-1.compute.internal
Normal Pulling 3m8s kubelet Pulling image "k8s.gcr.io/kube-scheduler:v1.23.10"
Normal Pulled 3m6s kubelet Successfully pulled image "k8s.gcr.io/kube-scheduler:v1.23.10" in 2.371033007s
Normal Created 3m3s kubelet Created container kube-scheduler
Normal Started 3m3s kubelet Started container kube-scheduler
Warning Unhealthy 3m3s kubelet Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused
I am having troubles understanding what "Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused" really means. I know there are similar questions relating to this, but so far their answers didn't work for me. I tried to have multiple nodes in the node group with nodes that have more storage, and I made sure the role have the right permission (according to the guide).
I am clearly missing something here, and I am more happy if someone could shed some light on this situation for me.
I followed this guide to set up K8S on an EC2 instance, and I can't get it to work. I am not using EKS, but instead I want to make my own installation of K8S by running three EC2 instances and make one the master and the two others the workers.
I installed Docker, which works fine:
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.7
GitCommit: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
runc:
Version: 1.1.3
GitCommit: v1.1.3-0-g6724737
docker-init:
Version: 0.19.0
GitCommit: de40ad0
And kubectl, kubeadm and kubelet.
I ran kubeadm init to initialize my cluster, but after doing so the cluster doesn't manage to start successfully. It is possible to run kubectl cluster-info for a while, but after some time kubelet seems to just give up and commands answer with The connection to the server 172.31.10.55:6443 was refused - did you specify the right host or port?.
When looking at the output of kubectl describe pod kube-apiserver-master -n=kube-system I can see the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 30m kubelet Stopping container kube-apiserver
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 29m kubelet Readiness probe failed: Get "https://172.31.10.55:6443/readyz": dial tcp 172.31.10.55:6443: connect: connection refused
Warning Unhealthy 29m kubelet Liveness probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 27m (x3 over 27m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 27m (x15 over 27m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 25m (x2 over 25m) kubelet Startup probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal Killing 3m52s (x7 over 29m) kubelet Stopping container kube-apiserver
Looking at this, it seems that the issue is that K8S fails to access its own API on port 6443, but the port is not restricted in AWS and anyway this should be internal requests, never coming out of the internal network.
I feel like I might be missing something obvious, but looking at other people with similar problems (6443 port reporting a connection refused) always seems unrelated to the issue I have here. I assume it's due to some network constraint in AWS but no idea which, since it's the first time I manage my own EC2 instances.
I am using aws EKS to deploy my application. There is a public subnet with a nodegroup in it. Then there is one private subnet with a node group as well. Everything was fine until I started using EFS. The pods for the efs csi driver would not launch on the private subnet node. The pod would give the following description:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m44s default-scheduler Successfully assigned kube-system/efs-csi-node-t79j6 to ip-192-168-93-186.ap-south-1.compute.internal
Normal Pulling 4m44s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0"
Warning Failed 2m28s kubelet Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.0.0":
rpc error: code = Unknown desc = Error response from daemon: Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 2m28s kubelet Error: ErrImagePull
Normal Pulling 2m28s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0"
Warning Failed 13s kubelet Failed to pull image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/csi-node-driver-registrar:v1.3.0": rpc error: code = Unknown desc = Error response from daemon:
Get https://602401143452.dkr.ecr.us-west-2.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 13s kubelet Error: ErrImagePull
Normal Pulling 13s kubelet Pulling image
"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/livenessprobe:v2.0.0"
While another efs CSI driver pod launched successfully over the public subnet node. I thought it is not able to pull the image because of no internet connectivity on the private node but even after giving the internet gateway, i see the same issue. Please suggest some solution for this.
This was the command used for launching the driver pod :
kubectl apply -k
"github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/ecr/?ref=release-1.0"
All you need to do is to update the region , the repo is having the default region us-west-2, so if you are accessing the image from a different region you will be getting the imagepullbackoff error.
kubectl edit daemonsets/efs-csi-node -n kube-system
And then update it with your eks region for all the 3 images specified in the efs-csi daemonsets .
I have set up a SAP Vora2.1 installation on AWS using kops. It is a 4 node cluster with 1 master and 3 nodes. the persistent volume requirements for vsystem-vrep is provided using AWS-EFS and for other stateful components by using AWS-EBS. While the installation goes through fine and runs for few days but after 3-4 days following 5 vora pods starts showing some issues,
vora-catalog
Vora-relational
Vora-timeseries
vora-tx-coordinator
vora-disk
Each of these pods has 2 containers and both should be up and running. However after 3-4 days one of the containers goes down on its own although kubernetes cluster is up and running. I tried various ways to bring these pods up and running with all required containers in it but it does not come up.
I have captured events for vora-disk as sample but all of pods show same trace,
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 7m 21 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Liveness probe failed: dial tcp 100.96.7.21:10002: getsockopt: connection refused
1h 2m 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Killing Killing container with id docker://disk:pod "vora-disk-0_vora(2f5ea6df-545b-11e8-90fd-029979a0ef92)" container "disk" is unhealthy, it will be killed and re-created.
1h 58s 51 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal Warning FailedSync Error syncing pod
1h 58s 41 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning BackOff Back-off restarting failed container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Started Started container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Pulled Container image "ip-172-31-13-236.ap-southeast-2.compute.internal:5000/vora/dqp:2.1.32.19-vora-2.1" already present on machine
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Created Created container
1h 1s 988 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
Appreciate if any pointers to resolve this issue.
Thanks Frank for you suggestion and pointer. Definitely this has helped to overcome few issues but not all.
We have specifically observed issues related to Vora services going down for no reason. While we understand that there may be some reason why Vora goes down however the recovery procedure is not available either in admin guide or anywhere on internet. We have seen Vora services created by vora-operator going down (each of these pods contains one security container and other service specific container. Service specific container goes down and does not come up). we tried various options like restarting all vora pods or only restarting pods related to vora deployment operator but these pods do not come up. We are re-deploying Vora in such cases but that essentially means all previous work goes away. Is there any command or way so that Vora pods comes up with all container?
This issue is described in SAP Note 2631736 - Liveness and Readiness issue in Vora 2.x - it is suggested to increase the health check interval.
So I'm trying to create a pod on a Kubernetes cluster. Here is the yml file from which I am creating the pod.
kind: Pod
apiVersion: v1
metadata:
name: task-pv-pod2
spec:
containers:
- name: task-pv-container2
image: <<image_name>>
The pod hangs at container creating. Here is the output of kubectl describe pod.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
10s 10s 1 default-scheduler Normal Scheduled Successfully assigned task-pv-pod2 to ip-10-205-234-170.ec2.internal
8s 8s 1 kubelet, ip-10-205-234-170.ec2.internal Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "task-pv-pod2_default" with SetupNetworkError: "NetworkPlugin cni failed to set up pod \"task-pv-pod2_default\" network: client: etcd cluster is unavailable or misconfigured; error #0: x509: cannot validate certificate for 10.205.234.170 because it doesn't contain any IP SANs\n; error #1: x509: cannot validate certificate for 10.205.235.160 because it doesn't contain any IP SANs\n; error #2: x509: cannot validate certificate for 10.205.234.162 because it doesn't contain any IP SANs\n"
7s 6s 2 kubelet, ip-10-205-234-170.ec2.internal Warning FailedSync Error syncing pod, skipping: failed to "TeardownNetwork" for "task-pv-pod2_default" with TeardownNetworkError: "NetworkPlugin cni failed to teardown pod \"task-pv-pod2_default\" network: client: etcd cluster is unavailable or misconfigured; error #0: x509: cannot validate certificate for 10.205.234.170 because it doesn't contain any IP SANs\n; error #1: x509: cannot validate certificate for 10.205.235.160 because it doesn't contain any IP SANs\n; error #2: x509: cannot validate certificate for 10.205.234.162 because it doesn't contain any IP SANs\n"
Does anyone know what might be causing this. In order to Kubernetes to work with aws as a cloud provider I had to set a proxy variable in the hyperkube container. Co
It seem your ETCD's certs as not trusted for the name (or IP) you access on. I suggest to you to check your cluster health with kubectl get cs and modify k8s way to talk to ETCD if needed.