AWS EKS fargate coredns ImagePullBackOff - amazon-web-services

I'm trying to deploy a simple tutorial app to a new fargate based kubernetes cluster.
Unfortunately I'm stuck on ImagePullBackOff for the coredns pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled 5m51s fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled 4m11s fargate-scheduler Successfully assigned kube-system/coredns-86cb968586-mcdpj to fargate-ip-172-31-55-205.eu-central-1.compute.internal
Warning Failed 100s kubelet Failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": rpc error: code = Unknown desc = failed to pull and unpack image "602
401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1": failed to do request: Head "https://602401143452.dkr.
ecr.eu-central-1.amazonaws.com/v2/eks/coredns/manifests/v1.8.0-eksbuild.1": dial tcp 3.122.9.124:443: i/o timeout
Warning Failed 100s kubelet Error: ErrImagePull
Normal BackOff 99s kubelet Back-off pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
Warning Failed 99s kubelet Error: ImagePullBackOff
Normal Pulling 87s (x2 over 4m10s) kubelet Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.0-eksbuild.1"
While googling I found https://aws.amazon.com/premiumsupport/knowledge-center/eks-ecr-troubleshooting/
It contains a following list:
To resolve this error, confirm the following:
- The subnet for your worker node has a route to the internet. Check the route table associated with your subnet.
- The security group associated with your worker node allows outbound internet traffic.
- The ingress and egress rule for your network access control lists (ACLs) allows access to the internet.
Since I've created both my private subnets as well as their NAT Gateways manually I tried to locate an issue here but couldn't find anything. They as well as security groups and ACLs look fine to me.
I even added the AmazonEC2ContainerRegistryReadOnly to my EKS role but after issuing command kubectl rollout restart -n kube-system deployment coredns the result is unfortunately the same: ImagePullBackOff
Unfortunately I've runned out of ideas and I'm stuck. Any help that would help me troubleshoot this would be greatly appreciated. ~Thanks
edit>
After creating new cluster via *eksctl as #mreferre suggested in his comment I get RBAC error with link: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting_iam.html#security-iam-troubleshoot-cannot-view-nodes-or-workloads
I'm not sure what is going on since I already have
edit>>
The cluster created via AWS Console ( web interface ) doesn't have the configmap aws-auth I've retrieved the configmap below using command kubectl edit configmap aws-auth -n kube-system
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::370179080679:role/eksctl-tutorial-cluster-FargatePodExecutionRole-1J605HWNTGS2Q
username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
creationTimestamp: "2021-04-08T18:42:59Z"
name: aws-auth
namespace: kube-system
resourceVersion: "918"
selfLink: /api/v1/namespaces/kube-system/configmaps/aws-auth
uid: d9a21964-a8bf-49e9-800f-650320b7444e

Creating an answer to sum up the discussion in the comment that deemed to be acceptable. The most common (and arguably easier) way to setup an EKS cluster with Fargate support is to use EKSCTL and setup the cluster using eksctl create cluster --fargate. This will build all the plumbing for you and you will get a cluster with no EC2 instances nor managed node groups with the two CoreDNS pods deployed on two Fargate instances. Note that when you deploy EKSCTL via the command line you may end up using different roles/users between your CLI and console. This may result in access denied issues. Best course of action would be to use a non-root user to login into the AWS console and use CloudShell to deploy with EKSCTL (CloudShell will inherit the same console user identity). {More info in the comments}

Related

Cluster auto scaler pod crashing timeout sts.us-west-1.amazonaws.com

I am following this document to deploy cluster auto scaler in EKS https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html
EKS Version is 1.24. Cluster. Public traffic is allowed on the open internet and we have whitelisted the .amazonaws.com domain in the squid proxy.
I feel there might be something wrong with the role or policy configuration
Error in pod:
F0208 05:39:52.442470 1 aws_cloud_provider.go:386] Failed to
generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve
credentials caused by: RequestError: send request failed caused by:
Post "https://sts.us-west-1.amazonaws.com/": dial tcp
176.32.112.54:443: i/o timeout
The service account has the annotation in place to make use of the IAM role
Kubectl describes cluster-autoscaler service account
Name: cluster-autoscaler
Namespace: kube-system
Labels: k8s-addon=cluster-autoscaler.addons.k8s.io
k8s-app=cluster-autoscaler
Annotations: eks.amazonaws.com/role-arn: arn:aws:iam::<ID>:role/irsa-clusterautoscaler
Image pull secrets: <none>
Mountable secrets: <none>
Tokens: <none>
Events: <none>
It was solved by adding the proxy details on the container env of the deployment. Which is missing in the actual documentation, they could add it as a hint. Pod was not taking the proxy setting available in the node, it was expecting it to be configured.

network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container issue in EKS. help anybody, please

When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:

403 Forbidden on ESPv2, GKE AutoPilot, WIF

I'm following the Getting started with Endpoints for GKE with ESPv2. I'm using Workload Identity Federation and Autopilot on the GKE cluster.
I've been running into the error:
F0110 03:46:24.304229 8 server.go:54] fail to initialize config manager: http call to GET https://servicemanagement.googleapis.com/v1/services/name:bookstore.endpoints.<project>.cloud.goog/rollouts?filter=status=SUCCESS returns not 200 OK: 403 Forbidden
Which ultimately leads to a transport failure error and shut down of the Pod.
My first step was to investigate permission issues, but I could really use some outside perspective on this as I've been going around in circles on this.
Here's my config:
>> gcloud container clusters describe $GKE_CLUSTER_NAME \
--zone=$GKE_CLUSTER_ZONE \
--format='value[delimiter="\n"](nodePools[].config.oauthScopes)'
['https://www.googleapis.com/auth/devstorage.read_only',
'https://www.googleapis.com/auth/logging.write',
'https://www.googleapis.com/auth/monitoring',
'https://www.googleapis.com/auth/service.management.readonly',
'https://www.googleapis.com/auth/servicecontrol',
'https://www.googleapis.com/auth/trace.append']
>> gcloud container clusters describe $GKE_CLUSTER_NAME \
--zone=$GKE_CLUSTER_ZONE \
--format='value[delimiter="\n"](nodePools[].config.serviceAccount)'
default
default
Service-Account-Name: test-espv2
Roles
Cloud Trace Agent
Owner
Service Account Token Creator
Service Account User
Service Controller
Workload Identity User
I've associated the WIF svc-act with the Cluster with the following yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
iam.gke.io/gcp-service-account: test-espv2#<project>.iam.gserviceaccount.com
name: test-espv2
namespace: eventing
And then I've associated the pod with the test-espv2 svc-act
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: esp-grpc-bookstore
namespace: eventing
spec:
replicas: 1
selector:
matchLabels:
app: esp-grpc-bookstore
template:
metadata:
labels:
app: esp-grpc-bookstore
spec:
serviceAccountName: test-espv2
Since the gcr.io/endpoints-release/endpoints-runtime:2 is limited,
I created a test container and deployed it into the same eventing namespace.
Within the container, I'm able to retrieve the endpoint service config with the following command:
curl --fail -o "service.json" -H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://servicemanagement.googleapis.com/v1/services/${SERVICE}/configs/${CONFIG_ID}?view=FULL"
And also within the container, I'm running as the impersonated service account, tested with:
curl -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/
Are there any other tests I can run to help me debug this issue?
Thanks in advance,
Around debugging - I've often found my mistakes by following one of the other methods/programming languages in the Google tutorials.
Have you looked at the OpenAPI notes and tried to follow along?
I've finally figured out the issue. It was in 2 parts.
Redeployment of app, paying special attention and verification of the kubectl annotate serviceaccount commands
add-iam-policy-binding for both serviceController and cloudtrace.agent
omitting nodeSelector: iam.gke.io/gke-metadata-server-enabled: "true" due to Autopilot
Doing this enabled a successful kube deployment as displayed by the logs.
Next error I had was
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
This was fixed by turning my attention back to my Kube cluster.
Looking through the events in my ingress service, since I was in a shared-vpc and my security policies only allowed firewall management from the host project, the deployment was failing to update the firewall rules.
Manually provisioning them, as shown here :
https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#manually_provision_firewall_rules_from_the_host_project
solved my issues.

Fail to init aws cluster (kubeadm init) with the message "could not init cloud provider "aws": error finding instance ... timeout

The issue I have is that kubeadm will never fully initialize. The output:
...
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
and journalctl -xeu kubelet shows the following interesting info:
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: W1203 17:54:08.017925 14709 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018044 14709 aws.go:1235] Building AWS cloudprovider
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018112 14709 aws.go:1195] Zone not specified in configuration file; querying AWS metadata service
Dec 03 17:56:08 ip-10-83-62-10.ec2.internal kubelet[14709]: F1203 17:56:08.332951 14709 server.go:265] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp 10.83.60.11:443: i/o timeout
The context is: it's a fully private AWS VPC. There is a proxy that is propagated to k8s manifests.
the kubeadm.yaml config is pretty innocent and looks like this
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
extraArgs:
cloud-provider: aws
clusterName: cdspidr
controlPlaneEndpoint: ip-10-83-62-10.ec2.internal
controllerManager:
extraArgs:
cloud-provider: aws
configure-cloud-routes: "false"
kubernetesVersion: stable
networking:
dnsDomain: cluster.local
podSubnet: 10.83.62.0/24
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
name: ip-10-83-62-10.ec2.internal
kubeletExtraArgs:
cloud-provider: was
I'm looking for help to figure out a couple of things here:
why does kubeadm use this address (https://ec2.us-east-1.amazonaws.com) to retrieve availability zones? It does not look correct. IMO, it should be something like http://169.254.169.254/latest/dynamic/instance-identity/document
why does it fail? With the same proxy settings, a curl request from the terminal returns the web page.
To workaround it, how can I specify availability zones on my own in kubeadm.yaml or via a command like for kubeadm?
I would appreciate any help or thoughts.
You can create a VPC endpoint for accessing Ec2 (service name - com.amazonaws.us-east-1.ec2), this will allow the kubelet to talk to Ec2 without internet and fetch the required info.
While creating the VPC endpoint please make sure to enable private DNS resolution option.
Also from the error it looks like that kubelet is trying to fetch the instance not just availability zone. ("aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances).

How does kubernetes select nodes to add to the load balancers on AWS?

Some info:
Kubernetes (1.5.1)
AWS
1 master and 1 node (both ubuntu 16.04)
k8s installed via kubeadm
Terraform made by me
Please don't reply use kube-up, kops or similar. This is about understanding how k8s works under the hood. There is by far too much unexplained magic in the system and I want to understand it.
== Question:
When creating a Service of type load balancer on k8s[aws] (for example):
apiVersion: v1
kind: Service
metadata:
name: kubernetes-dashboard
namespace: kube-system
labels:
k8s-addon: kubernetes-dashboard.addons.k8s.io
k8s-app: kubernetes-dashboard
kubernetes.io/cluster-service: "true"
facing: external
spec:
type: LoadBalancer
selector:
k8s-app: kubernetes-dashboard
ports:
- port: 80
I successfully create an internal or external facing ELB but none of the machines are added to the ELB (I can taint the master too but nothing changes). My problem is basically this:
https://github.com/kubernetes/kubernetes/issues/29298#issuecomment-260659722
The subnets and nodes (but not the VPC) are all tagged with "KubernetesCluster" (again... elb are created in the right place). However no nodes is added.
In the logs
kubectl logs kube-controller-manager-ip-x-x-x-x -n kube-system
after:
aws_loadbalancer.go:63] Creating load balancer for
kube-system/kubernetes-dashboard with name:
acd8acca0c7a111e69ca306f22de69ae
There is no other output (it should print the nodes added or removed). I tried to understand the code at:
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_loadbalancer.go
But whatever is the reason, this function to not add nodes.
The documentation doesn't go at length trying to explain the "process" behind k8s decisions. To try to understand k8s I tried/used kops, kube up, kubeadm, kubernetes the hard way repo and reading damn code, but still I am unable to understand how k8s on aws SELECTS the node to add to the elb.
As a consequence, also no security group is changed anywhere.
Is it a tag on the ec2?
Kublet setting?
Anything else?
Any help is greatly appreciated.
Thanks,
F.
I think Steve is on the right track. Make sure your kubelets, apiserver, and controller-manager components all include --cloud-provider=aws in their arguments lists.
You mention your subnets and instances all have matching KubernetesCluster tags. Do your controller & worker security groups? K8s will modify the worker SG in particular to allow traffic to/from the service ELBs it creates. I tag my VPC as well, though I guess it's not required and may prohibit another cluster from living in the same VPC.
I also tag my private subnets with kubernetes.io/role/internal-elb=true and public ones with kubernetes.io/role/elb=true to identify where internal and public ELBs can be created.
The full list (AFAIK) of tags and annotations lives in https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go
I think the node registration is being managed outside of Kubernetes. I'm using kops and if I edit the size of my ASG in AWS the new nodes are not registered with my service ELBs. But if I edit the number of nodes using kops the new nodes are there.
In the docs a kops instance group maps to an ASG when running on AWS. In the code it looks like its calling AWS rather than a k8s API.
I know you're not using kops but I think in Terraform you need to replicate the AWS API calls that kops is making.
Make sure you are setting the correct cloud provider settings with kubeadm (http://kubernetes.io/docs/admin/kubeadm/).
The AWS cloud provider automatically syncs the nodes available with the ELB. I created an type LoadBalancer then scaled my cluster and the new node was eventually added the ELB: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_loadbalancer.go#L376