I started a kubernetes cluster in AWS using the AWS Heptio-Kubernetes Quickstart about a month ago. I had been merrily installing applications onto it until recently when I noticed that some of my pods weren't behaving correctly, and some were stuck in "terminating" status or wouldn't initialize.
After reading through some of the troubleshooting guides I realized that so of the core system pods in the "kube-system" namespace were not running: kube-apiserver, kube-controller-manager, and kube-scheduler. This would explain why my deployments were no longer scaling as expected and why terminating pods will not delete. I can however still run commands and view cluster status with kubectl. See the screenshot below:
Not sure where to start to mitigate this. I've tried rebooting the server, I've stopped and restarted kubeadm with systemctl, and I've tried manually deleting the pods in /var/lib/kubelet/pods. Any help is greatly appreciated.
EDIT: I just realized some of my traffic might be blocked by the container security tool we installed on our worker nodes called Twistlock. I will consult with them as it may be blocking connectivity on the nodes.
I realized it might be connectivity issues when gathering logs for each of the kubernetes pods, see below for log excerpts ( i have redacted the IPs):
kubectl logs kube-controller-manager-ip-*************.us-east-2.compute.internal -n kube-system
E0723 18:33:37.056730 1 route_controller.go:117] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: kubernetes
kubectl -n kube-system logs kube-apiserver-ip-***************.us-east-2.compute.internal
I0723 18:38:23.380163 1 logs.go:49] http: TLS handshake error from ********: EOF
I0723 18:38:27.511654 1 logs.go:49] http: TLS handshake error from ********: EOF
kubectl -n kube-system logs kube-scheduler-ip-*******.us-east-2.compute.internal
E0723 15:31:54.397921 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.ReplicaSet: Get https://**********:6443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp ************: getsockopt: connection refused
E0723 15:31:54.398008 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Node: Get https://*********/api/v1/nodes?limit=500&resourceVersion=0: dial tcp ********:6443: getsockopt: connection refused
E0723 15:31:54.398075 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.ReplicationController: Get https://************8:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
E0723 15:31:54.398207 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Service: Get https://************:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
Edit: After contacting our Twistlock vendors I have verified that the connectivity issues are not due to Twistlock as there are no policies set in place to actually block or isolate the containers yet. My issue with the cluster still stands.
Related
I want to deploy JupyterHub on AWS' EKS service, and I am following the Zero to JupyterHub with Kubernetes guide to achieve this.
I am using the eksctl tool to deploy one cluster with a node group with one node that is represented by a t3.medium EC2 instance. After I deployed JupyterHub according to the instructions given in the guide, I get the following output when running kubectl get pods:
NAME READY STATUS RESTARTS AGE
continuous-image-puller-kl67x 1/1 Running 0 56s
hub-84b6467ff8-spjws 0/1 Pending 0 56s
proxy-79d75ddf8d-76rqm 1/1 Running 0 56s
user-scheduler-795f7d845f-7b8bn 1/1 Running 0 56s
user-scheduler-795f7d845f-mgks9 1/1 Running 0 56s
One pod, hub-84b6467ff8-spjws, remains in peding mode. kubectl describe pods outputs the following at the end:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m9s default-scheduler Successfully assigned jh/user-scheduler-795f7d845f-mgks9 to ip-192-168-20-191.eu-west-1.compute.internal
Normal Pulling 3m8s kubelet Pulling image "k8s.gcr.io/kube-scheduler:v1.23.10"
Normal Pulled 3m6s kubelet Successfully pulled image "k8s.gcr.io/kube-scheduler:v1.23.10" in 2.371033007s
Normal Created 3m3s kubelet Created container kube-scheduler
Normal Started 3m3s kubelet Started container kube-scheduler
Warning Unhealthy 3m3s kubelet Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused
I am having troubles understanding what "Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused" really means. I know there are similar questions relating to this, but so far their answers didn't work for me. I tried to have multiple nodes in the node group with nodes that have more storage, and I made sure the role have the right permission (according to the guide).
I am clearly missing something here, and I am more happy if someone could shed some light on this situation for me.
When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:
I recently deploy an AWS EKS cluster with Managed nodes group. I notice when I run
kubectl get csr
All the CSRs listed are in Pending state. There are about 200 CSRs. When I try to SSH to pod it is giving "Error from server: error dialing backend: remote error: tls: internal error"
Do you guys face this issue?
The issue I have is that kubeadm will never fully initialize. The output:
...
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
and journalctl -xeu kubelet shows the following interesting info:
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: W1203 17:54:08.017925 14709 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018044 14709 aws.go:1235] Building AWS cloudprovider
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018112 14709 aws.go:1195] Zone not specified in configuration file; querying AWS metadata service
Dec 03 17:56:08 ip-10-83-62-10.ec2.internal kubelet[14709]: F1203 17:56:08.332951 14709 server.go:265] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp 10.83.60.11:443: i/o timeout
The context is: it's a fully private AWS VPC. There is a proxy that is propagated to k8s manifests.
the kubeadm.yaml config is pretty innocent and looks like this
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
extraArgs:
cloud-provider: aws
clusterName: cdspidr
controlPlaneEndpoint: ip-10-83-62-10.ec2.internal
controllerManager:
extraArgs:
cloud-provider: aws
configure-cloud-routes: "false"
kubernetesVersion: stable
networking:
dnsDomain: cluster.local
podSubnet: 10.83.62.0/24
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
name: ip-10-83-62-10.ec2.internal
kubeletExtraArgs:
cloud-provider: was
I'm looking for help to figure out a couple of things here:
why does kubeadm use this address (https://ec2.us-east-1.amazonaws.com) to retrieve availability zones? It does not look correct. IMO, it should be something like http://169.254.169.254/latest/dynamic/instance-identity/document
why does it fail? With the same proxy settings, a curl request from the terminal returns the web page.
To workaround it, how can I specify availability zones on my own in kubeadm.yaml or via a command like for kubeadm?
I would appreciate any help or thoughts.
You can create a VPC endpoint for accessing Ec2 (service name - com.amazonaws.us-east-1.ec2), this will allow the kubelet to talk to Ec2 without internet and fetch the required info.
While creating the VPC endpoint please make sure to enable private DNS resolution option.
Also from the error it looks like that kubelet is trying to fetch the instance not just availability zone. ("aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances).
I'm working on k8s setup with 1 master node and 1 worker node. I'm done with master setup and now I'm trying to joining node to cluster:
sudo kubeadm join master_ip:6443 --token [token] --discovery-token-ca-cert-hash sha256:[key]
But got this error:
[discovery] Trying to connect to API Server "master_ip:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://master_ip:6443"
[discovery] Failed to request cluster info, will try again: [Get https://master_ip:6443/api/v1/namespaces/kube-public/configmaps/cluster-info: dial tcp master_ip:6443: i/o timeout]
I use two EC2 instances with CentOS 7 (1 for master and 1 for worker). I'm able to telnet master_ip 6443 within the master, but failed within the worker.
What's going wrong here?
I solved this by setting AWS security group for the port.