AWS EKS - CSR Pending - amazon-web-services

I recently deploy an AWS EKS cluster with Managed nodes group. I notice when I run
kubectl get csr
All the CSRs listed are in Pending state. There are about 200 CSRs. When I try to SSH to pod it is giving "Error from server: error dialing backend: remote error: tls: internal error"
Do you guys face this issue?

Related

network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container issue in EKS. help anybody, please

When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:

AWS EKS fargate coredns ImagePullBackOff

I'm trying to deploy a simple tutorial app to a new fargate based kubernetes cluster.
Unfortunately I'm stuck on ImagePullBackOff for the coredns pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled 5m51s fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled 4m11s fargate-scheduler Successfully assigned kube-system/coredns-86cb968586-mcdpj to fargate-ip-172-31-55-205.eu-central-1.compute.internal
Warning Failed 100s kubelet Failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": rpc error: code = Unknown desc = failed to pull and unpack image "602
401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to do request: Head "https://602401143452.dkr.
ecr.eu-central-1.amazonaws.com/v2/eks/coredns/manifests/v1.8.0-eksbuild.1": dial tcp 3.122.9.124:443: i/o timeout
Warning Failed 100s kubelet Error: ErrImagePull
Normal BackOff 99s kubelet Back-off pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
Warning Failed 99s kubelet Error: ImagePullBackOff
Normal Pulling 87s (x2 over 4m10s) kubelet Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
While googling I found https://aws.amazon.com/premiumsupport/knowledge-center/eks-ecr-troubleshooting/
It contains a following list:
To resolve this error, confirm the following:
- The subnet for your worker node has a route to the internet. Check the route table associated with your subnet.
- The security group associated with your worker node allows outbound internet traffic.
- The ingress and egress rule for your network access control lists (ACLs) allows access to the internet.
Since I've created both my private subnets as well as their NAT Gateways manually I tried to locate an issue here but couldn't find anything. They as well as security groups and ACLs look fine to me.
I even added the AmazonEC2ContainerRegistryReadOnly to my EKS role but after issuing command kubectl rollout restart -n kube-system deployment coredns the result is unfortunately the same: ImagePullBackOff
Unfortunately I've runned out of ideas and I'm stuck. Any help that would help me troubleshoot this would be greatly appreciated. ~Thanks
edit>
After creating new cluster via *eksctl as #mreferre suggested in his comment I get RBAC error with link: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting_iam.html#security-iam-troubleshoot-cannot-view-nodes-or-workloads
I'm not sure what is going on since I already have
edit>>
The cluster created via AWS Console ( web interface ) doesn't have the configmap aws-auth I've retrieved the configmap below using command kubectl edit configmap aws-auth -n kube-system
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::370179080679:role/eksctl-tutorial-cluster-FargatePodExecutionRole-1J605HWNTGS2Q
username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
creationTimestamp: "2021-04-08T18:42:59Z"
name: aws-auth
namespace: kube-system
resourceVersion: "918"
selfLink: /api/v1/namespaces/kube-system/configmaps/aws-auth
uid: d9a21964-a8bf-49e9-800f-650320b7444e
Creating an answer to sum up the discussion in the comment that deemed to be acceptable. The most common (and arguably easier) way to setup an EKS cluster with Fargate support is to use EKSCTL and setup the cluster using eksctl create cluster --fargate. This will build all the plumbing for you and you will get a cluster with no EC2 instances nor managed node groups with the two CoreDNS pods deployed on two Fargate instances. Note that when you deploy EKSCTL via the command line you may end up using different roles/users between your CLI and console. This may result in access denied issues. Best course of action would be to use a non-root user to login into the AWS console and use CloudShell to deploy with EKSCTL (CloudShell will inherit the same console user identity). {More info in the comments}

Fail to init aws cluster (kubeadm init) with the message "could not init cloud provider "aws": error finding instance ... timeout

The issue I have is that kubeadm will never fully initialize. The output:
...
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
and journalctl -xeu kubelet shows the following interesting info:
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: W1203 17:54:08.017925 14709 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018044 14709 aws.go:1235] Building AWS cloudprovider
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018112 14709 aws.go:1195] Zone not specified in configuration file; querying AWS metadata service
Dec 03 17:56:08 ip-10-83-62-10.ec2.internal kubelet[14709]: F1203 17:56:08.332951 14709 server.go:265] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp 10.83.60.11:443: i/o timeout
The context is: it's a fully private AWS VPC. There is a proxy that is propagated to k8s manifests.
the kubeadm.yaml config is pretty innocent and looks like this
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
extraArgs:
cloud-provider: aws
clusterName: cdspidr
controlPlaneEndpoint: ip-10-83-62-10.ec2.internal
controllerManager:
extraArgs:
cloud-provider: aws
configure-cloud-routes: "false"
kubernetesVersion: stable
networking:
dnsDomain: cluster.local
podSubnet: 10.83.62.0/24
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
name: ip-10-83-62-10.ec2.internal
kubeletExtraArgs:
cloud-provider: was
I'm looking for help to figure out a couple of things here:
why does kubeadm use this address (https://ec2.us-east-1.amazonaws.com) to retrieve availability zones? It does not look correct. IMO, it should be something like http://169.254.169.254/latest/dynamic/instance-identity/document
why does it fail? With the same proxy settings, a curl request from the terminal returns the web page.
To workaround it, how can I specify availability zones on my own in kubeadm.yaml or via a command like for kubeadm?
I would appreciate any help or thoughts.
You can create a VPC endpoint for accessing Ec2 (service name - com.amazonaws.us-east-1.ec2), this will allow the kubelet to talk to Ec2 without internet and fetch the required info.
While creating the VPC endpoint please make sure to enable private DNS resolution option.
Also from the error it looks like that kubelet is trying to fetch the instance not just availability zone. ("aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances).

Kubernetes on AWS using Kops - kube-apiserver authentication for kubectl

I have setup a basic 2 node k8s cluster on AWS using KOPS .. I had issues connecting and interacting with the cluster using kubectl ... and I keep getting the error:
The connection to the server api.euwest2.dev.avi.k8s.com was refused - did you specify the right host or port? when trying to run any kubectl command .....
have done basic kops export kubecfg --name xyz.hhh.kjh.k8s.com --config=~$KUBECONFIG --> to export the kubeconfig for the cluster I have created. Not sure what else I'm missing to make a successful connection to the kubeapi-server to make kubectl work ?
Sounds like either:
Your kube-apiserver is not running.
Check with docker ps -a | grep apiserver on your Kubernetes master.
api.euwest2.dev.avi.k8s.com is resolving to an IP address where your nothing is listening.
208.73.210.217?
You have the wrong port configured for your kube-apiserver on your ~/.kube/config
server: https://api.euwest2.dev.avi.k8s.com:6443?

Kubernetes Scheduler, API Server, and Controller Manager Containers not running in cluster

I started a kubernetes cluster in AWS using the AWS Heptio-Kubernetes Quickstart about a month ago. I had been merrily installing applications onto it until recently when I noticed that some of my pods weren't behaving correctly, and some were stuck in "terminating" status or wouldn't initialize.
After reading through some of the troubleshooting guides I realized that so of the core system pods in the "kube-system" namespace were not running: kube-apiserver, kube-controller-manager, and kube-scheduler. This would explain why my deployments were no longer scaling as expected and why terminating pods will not delete. I can however still run commands and view cluster status with kubectl. See the screenshot below:
Not sure where to start to mitigate this. I've tried rebooting the server, I've stopped and restarted kubeadm with systemctl, and I've tried manually deleting the pods in /var/lib/kubelet/pods. Any help is greatly appreciated.
EDIT: I just realized some of my traffic might be blocked by the container security tool we installed on our worker nodes called Twistlock. I will consult with them as it may be blocking connectivity on the nodes.
I realized it might be connectivity issues when gathering logs for each of the kubernetes pods, see below for log excerpts ( i have redacted the IPs):
kubectl logs kube-controller-manager-ip-*************.us-east-2.compute.internal -n kube-system
E0723 18:33:37.056730 1 route_controller.go:117] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: kubernetes
kubectl -n kube-system logs kube-apiserver-ip-***************.us-east-2.compute.internal
I0723 18:38:23.380163 1 logs.go:49] http: TLS handshake error from ********: EOF
I0723 18:38:27.511654 1 logs.go:49] http: TLS handshake error from ********: EOF
kubectl -n kube-system logs kube-scheduler-ip-*******.us-east-2.compute.internal
E0723 15:31:54.397921 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.ReplicaSet: Get https://**********:6443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: dial tcp ************: getsockopt: connection refused
E0723 15:31:54.398008 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Node: Get https://*********/api/v1/nodes?limit=500&resourceVersion=0: dial tcp ********:6443: getsockopt: connection refused
E0723 15:31:54.398075 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.ReplicationController: Get https://************8:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
E0723 15:31:54.398207 1 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Service: Get https://************:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp ***********:6443: getsockopt: connection refused
Edit: After contacting our Twistlock vendors I have verified that the connectivity issues are not due to Twistlock as there are no policies set in place to actually block or isolate the containers yet. My issue with the cluster still stands.