Deploying JupyterHub on AWS EKS: Readiness probe failed, connection refused - amazon-web-services

I want to deploy JupyterHub on AWS' EKS service, and I am following the Zero to JupyterHub with Kubernetes guide to achieve this.
I am using the eksctl tool to deploy one cluster with a node group with one node that is represented by a t3.medium EC2 instance. After I deployed JupyterHub according to the instructions given in the guide, I get the following output when running kubectl get pods:
NAME READY STATUS RESTARTS AGE
continuous-image-puller-kl67x 1/1 Running 0 56s
hub-84b6467ff8-spjws 0/1 Pending 0 56s
proxy-79d75ddf8d-76rqm 1/1 Running 0 56s
user-scheduler-795f7d845f-7b8bn 1/1 Running 0 56s
user-scheduler-795f7d845f-mgks9 1/1 Running 0 56s
One pod, hub-84b6467ff8-spjws, remains in peding mode. kubectl describe pods outputs the following at the end:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m9s default-scheduler Successfully assigned jh/user-scheduler-795f7d845f-mgks9 to ip-192-168-20-191.eu-west-1.compute.internal
Normal Pulling 3m8s kubelet Pulling image "k8s.gcr.io/kube-scheduler:v1.23.10"
Normal Pulled 3m6s kubelet Successfully pulled image "k8s.gcr.io/kube-scheduler:v1.23.10" in 2.371033007s
Normal Created 3m3s kubelet Created container kube-scheduler
Normal Started 3m3s kubelet Started container kube-scheduler
Warning Unhealthy 3m3s kubelet Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused
I am having troubles understanding what "Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused" really means. I know there are similar questions relating to this, but so far their answers didn't work for me. I tried to have multiple nodes in the node group with nodes that have more storage, and I made sure the role have the right permission (according to the guide).
I am clearly missing something here, and I am more happy if someone could shed some light on this situation for me.

Related

Setting up K8S on AWS machine fails: api server not starting

I followed this guide to set up K8S on an EC2 instance, and I can't get it to work. I am not using EKS, but instead I want to make my own installation of K8S by running three EC2 instances and make one the master and the two others the workers.
I installed Docker, which works fine:
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.7
GitCommit: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
runc:
Version: 1.1.3
GitCommit: v1.1.3-0-g6724737
docker-init:
Version: 0.19.0
GitCommit: de40ad0
And kubectl, kubeadm and kubelet.
I ran kubeadm init to initialize my cluster, but after doing so the cluster doesn't manage to start successfully. It is possible to run kubectl cluster-info for a while, but after some time kubelet seems to just give up and commands answer with The connection to the server 172.31.10.55:6443 was refused - did you specify the right host or port?.
When looking at the output of kubectl describe pod kube-apiserver-master -n=kube-system I can see the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 30m kubelet Stopping container kube-apiserver
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 29m kubelet Readiness probe failed: Get "https://172.31.10.55:6443/readyz": dial tcp 172.31.10.55:6443: connect: connection refused
Warning Unhealthy 29m kubelet Liveness probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 27m (x3 over 27m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 27m (x15 over 27m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 25m (x2 over 25m) kubelet Startup probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal Killing 3m52s (x7 over 29m) kubelet Stopping container kube-apiserver
Looking at this, it seems that the issue is that K8S fails to access its own API on port 6443, but the port is not restricted in AWS and anyway this should be internal requests, never coming out of the internal network.
I feel like I might be missing something obvious, but looking at other people with similar problems (6443 port reporting a connection refused) always seems unrelated to the issue I have here. I assume it's due to some network constraint in AWS but no idea which, since it's the first time I manage my own EC2 instances.

How to work with AWS cloud controller manager

I am trying to expose my applications running in my kubernetes cluster through AWS load balancer.
I followed the document https://cloudyuga.guru/blog/cloud-controller-manager and got till the point where i added --cloud-provider=external in kubeadm.conf file.
But this document is based on Digitial Ocean cloud and i'm working on AWS, i'm confused if i have to run any deployment.yaml file to get the pods running which are in pending status if so please provide me the link, i'm stuck at this point.
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-dlx76 0/1 Pending 0 3m32s
coredns-66bff467f8-svb6z 0/1 Pending 0 3m32s
etcd-ip-172-31-74-144.ec2.internal 1/1 Running 0 3m38s
kube-apiserver-ip-172-31-74-144.ec2.internal 1/1 Running 0 3m38s
kube-controller-manager-ip-172-31-74-144.ec2.internal 1/1 Running 0 3m37s
kube-proxy-rh8g4 1/1 Running 0 3m32s
kube-proxy-vsvlt 1/1 Running 0 3m28s
kube-scheduler-ip-172-31-74-144.ec2.internal 1/1 Running 0 3m37s
The coredns pods are pending because you have not installed a Pod Network add-on yet. From the docs here you can choose any supported Pod Network add-on. For example to use calico
kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml
After the Pod Network add-on is installed the coredns pods should come up.

Endpoints are not being updated with new IP address of a pod

Platform: AWS EKS
Output of helm version:
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.2", GitCommit:"a8b13cc5ab6a7dbef0a58f5061bcc7c0c61598e7", GitTreeState:"clean"}
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud Provider/Platform (AKS, GKE, Minikube etc.): AWS EKS
The problem:
After jenkins pod restart, the pod got a new IP address and ReadinesProbe supposed to update the endpoints but it doesn't.
kubectl get endpoints
jenkins <none>
jenkins-agent <none>
Error:
Readiness probe failed: Get http://192.168.0.109:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can successfully access above URL from all pods and worker nodes and I get correct Headers.
This happened after helm failed to upgrade jenkins and then I rollback the release, and it was successful (apart from now not updating endpoints)
Now I need to edit endpoints manually to point Endpoints to the correct IP address of a pod.
Current ReadinesProbe from deployment is:
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Log from Jenkins pod is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m13s default-scheduler Successfully assigned default/jenkins-pod-id to <ip>.<region>.compute.internal
Normal SuccessfulAttachVolume 8m6s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins"
Normal Pulling 8m4s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m57s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m56s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m56s kubelet, <ip>.<region>.compute.internal Started container
Normal Pulling 7m43s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m42s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m42s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m42s kubelet, <ip>.<region>.compute.internal Started container
Warning Unhealthy 6m40s kubelet, <ip>.<region>.compute.internal Readiness probe failed: Get http://<IP>:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The pod got IP almost instantly but it takes a few minutes for the container to start. How can I get the ReadinesProbe to update Endpoints or even get ReadinesProbe logs? This is running in AWS so no access to the controller to get more logs.
If I update the endpoints fast enough, the ReadinesProbe won't fail but this doesn't help when the pod restarts next time.
Update:
Just enabled EKS logs and got this:
deployment_controller.go:484] Error syncing deployment default/jenkins: Operation cannot be fulfilled on deployments.apps "jenkins": the object has been modified; please apply your changes to the latest version and try again
Below helped. The Readiness probe is still failing but this is due to Jenkins taking 90s to start. I will update this.
helm delete jenkins
release "jenkins" deleted
helm rollback jenkins 25
Rollback was a success! Happy Helming!

Kubernetes Scheduler, API Server, and Controller Manager Containers not running in cluster

I started a kubernetes cluster in AWS using the AWS Heptio-Kubernetes Quickstart about a month ago. I had been merrily installing applications onto it until recently when I noticed that some of my pods weren't behaving correctly, and some were stuck in "terminating" status or wouldn't initialize.
After reading through some of the troubleshooting guides I realized that so of the core system pods in the "kube-system" namespace were not running: kube-apiserver, kube-controller-manager, and kube-scheduler. This would explain why my deployments were no longer scaling as expected and why terminating pods will not delete. I can however still run commands and view cluster status with kubectl. See the screenshot below:
Not sure where to start to mitigate this. I've tried rebooting the server, I've stopped and restarted kubeadm with systemctl, and I've tried manually deleting the pods in /var/lib/kubelet/pods. Any help is greatly appreciated.
EDIT: I just realized some of my traffic might be blocked by the container security tool we installed on our worker nodes called Twistlock. I will consult with them as it may be blocking connectivity on the nodes.
I realized it might be connectivity issues when gathering logs for each of the kubernetes pods, see below for log excerpts ( i have redacted the IPs):
kubectl logs kube-controller-manager-ip-*************.us-east-2.compute.internal -n kube-system
E0723 18:33:37.056730 1 route_controller.go:117] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: kubernetes
kubectl -n kube-system logs kube-apiserver-ip-***************.us-east-2.compute.internal
I0723 18:38:23.380163 1 logs.go:49] http: TLS handshake error from ********: EOF
I0723 18:38:27.511654 1 logs.go:49] http: TLS handshake error from ********: EOF
kubectl -n kube-system logs kube-scheduler-ip-*******.us-east-2.compute.internal
E0723 15:31:54.397921 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.ReplicaSet: Get https://**********:6443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp ************: getsockopt: connection refused
E0723 15:31:54.398008 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Node: Get https://*********/api/v1/nodes?limit=500&resourceVersion=0: dial tcp ********:6443: getsockopt: connection refused
E0723 15:31:54.398075 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.ReplicationController: Get https://************8:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
E0723 15:31:54.398207 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Service: Get https://************:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
Edit: After contacting our Twistlock vendors I have verified that the connectivity issues are not due to Twistlock as there are no policies set in place to actually block or isolate the containers yet. My issue with the cluster still stands.

SAP Vora2.1 on AWS intermittently goes down

I have set up a SAP Vora2.1 installation on AWS using kops. It is a 4 node cluster with 1 master and 3 nodes. the persistent volume requirements for vsystem-vrep is provided using AWS-EFS and for other stateful components by using AWS-EBS. While the installation goes through fine and runs for few days but after 3-4 days following 5 vora pods starts showing some issues,
vora-catalog
Vora-relational
Vora-timeseries
vora-tx-coordinator
vora-disk
Each of these pods has 2 containers and both should be up and running. However after 3-4 days one of the containers goes down on its own although kubernetes cluster is up and running. I tried various ways to bring these pods up and running with all required containers in it but it does not come up.
I have captured events for vora-disk as sample but all of pods show same trace,
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 7m 21 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Liveness probe failed: dial tcp 100.96.7.21:10002: getsockopt: connection refused
1h 2m 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Killing Killing container with id docker://disk:pod "vora-disk-0_vora(2f5ea6df-545b-11e8-90fd-029979a0ef92)" container "disk" is unhealthy, it will be killed and re-created.
1h 58s 51 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal Warning FailedSync Error syncing pod
1h 58s 41 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning BackOff Back-off restarting failed container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Started Started container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Pulled Container image "ip-172-31-13-236.ap-southeast-2.compute.internal:5000/vora/dqp:2.1.32.19-vora-2.1" already present on machine
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Created Created container
1h 1s 988 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
Appreciate if any pointers to resolve this issue.
Thanks Frank for you suggestion and pointer. Definitely this has helped to overcome few issues but not all.
We have specifically observed issues related to Vora services going down for no reason. While we understand that there may be some reason why Vora goes down however the recovery procedure is not available either in admin guide or anywhere on internet. We have seen Vora services created by vora-operator going down (each of these pods contains one security container and other service specific container. Service specific container goes down and does not come up). we tried various options like restarting all vora pods or only restarting pods related to vora deployment operator but these pods do not come up. We are re-deploying Vora in such cases but that essentially means all previous work goes away. Is there any command or way so that Vora pods comes up with all container?
This issue is described in SAP Note 2631736 - Liveness and Readiness issue in Vora 2.x - it is suggested to increase the health check interval.