pods are stuck in CrashLoopBackOff after updating my eks to 1.16 - amazon-web-services

I just updated my eks from 1.15 to 1.16 and I couldn't get my deployments in my namespaces up and running. when I do kubectl get po and try to list my pods they're all stuck in CrashLoopBackOff state. I tried describe one pod and this is what I get in the events section
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 56m (x8 over 72m) kubelet Pulling image "xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/xxx-xxxx-xxxx:master.697.7af45fff8e0"
Warning BackOff 75s (x299 over 66m) kubelet Back-off restarting failed container
kuberntets version -
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-eks-e1a842", GitCommit:"e1a8424098604fa0ad8dd7b314b18d979c5c54dc", GitTreeState:"clean", BuildDate:"2021-07-31T01:19:13Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

It seems like your container is stuck in image pull state, here are somethings that you can check.
Ensure image is present in ECR
Ensure the EKS cluster is able to connect to ECR - If it's a private repo it would require credentials.
Run a docker pull and see if it's able to pull it directly (most likely it will fail or ask for credentials if not already passed)

So the problem is I was trying to deploy x86 containers on ARM node instance. Everything worked once I changed my launch template image for my node group

Related

kubectl wait - error: no matching resources found

I am installing metallb, but need to wait for resources to be created.
kubectl wait --for=condition=ready --timeout=60s -n metallb-system --all pods
But I get:
error: no matching resources found
If I dont wait I get:
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": dial tcp 10.106.91.126:443: connect: connection refused
Do you know how to wait for resources to be created before actually be able to wait for condition.
Info:
kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/arm64"}
For the error “no matching resources found”:
Wait for a minute and try again, It will be resolved.
You can find the explanation about that error in the following link Setting up Config Connector
For the error STDIN:
Follow the steps mentioned below:
You are getting this error because API server is NOT able to connect to the webhook
1)Check your Firewall Rules allowing TCP port 443 or not.
2)Temporarily disable the operator
kubectl -n config-management-system scale deployment config-management-operator --replicas=0
deployment.apps/config-management-operator scaled
Delete the deployment
kubectl delete deployments.apps -n <namespace> -system <namespace>-controller-manager
deployment.apps "namespace-controller-manager" deleted
3)create a configmap in the default namespace
kubectl create configmap foo
configmap/foo created
4)Check that configmap does not work with the label on the object
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: ConfigMap
metadata:
labels:
configmanagement.gke.io/debug-force-validation-webhook: "true"
name: foo
EOF
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "debug-validation.namespace.sh": failed to call webhook: Post "https://namespace-webhook-service.namespace-system.svc:443/v1/admit?timeout=3s": no endpoints available for service "namespace-webhook-service"
5)And finally do clean up by using the below commands :
kubectl delete configmap foo
configmap "foo" deleted
kubectl -n config-management-system scale deployment config-management-operator --replicas=1
deployment.apps/config-management-operator scaled

How to check if the latest Cloud Run revision is ready to serve

I've been using Cloud Run for a while and the entire user experience is simply amazing!
Currently I'm using Cloud Build to deploy the container image, push the image to GCR, then create a new Cloud Run revision.
Now I want to call a script to purge caches from CDN after the latest revision is successfully deployed to Cloud Run, however $ gcloud run deploy command can't tell you if the traffic is started to pointing to the latest revision.
Is there any command or the event that I can subscribe to to make sure no traffic is pointing to the old revision, so that I can safely purge all caches?
#Dustin’s answer is correct, however "status" messages are an indirect result of Route configuration, as those things are updated separately (and you might see a few seconds of delay between them). The status message will still be able to tell you the Revision has been taken out of rotation if you don't mind this.
To answer this specific question (emphasis mine) using API objects directly:
Is there any command or the event that I can subscribe to to make sure no traffic is pointing to the old revision?
You need to look at Route objects on the API. This is a Knative API (it's available on Cloud Run) but it doesn't have a gcloud command: https://cloud.google.com/run/docs/reference/rest/v1/namespaces.routes
For example, assume you did 50%-50% traffic split on your Cloud Run service. When you do this, you’ll find your Service object (which you can see on Cloud Console → Cloud Run → YAML tab) has the following spec.traffic field:
spec:
traffic:
- revisionName: hello-00002-mob
percent: 50
- revisionName: hello-00001-vat
percent: 50
This is "desired configuration" but it actually might not reflect the status definitively. Changing this field will go and update Route object –which decides how the traffic is splitted.
To see the Route object under the covers (sadly I'll have to use curl here because no gcloud command for this:)
TOKEN="$(gcloud auth print-access-token)"
curl -vH "Authorization: Bearer $TOKEN" \
https://us-central1-run.googleapis.com/apis/serving.knative.dev/v1/namespaces/GCP_PROJECT/routes/SERVICE_NAME
This command will show you the output:
"spec": {
"traffic": [
{
"revisionName": "hello-00002-mob",
"percent": 50
},
{
"revisionName": "hello-00001-vat",
"percent": 50
}
]
},
(which you might notice is identical with Service’s spec.traffic –because it's copied from there) that can tell you definitively which revisions are currently serving traffic for that particular Service.
You can use gcloud run revisions list to get a list of all revisions:
$ gcloud run revisions list --service helloworld
REVISION ACTIVE SERVICE DEPLOYED DEPLOYED BY
✔ helloworld-00009 yes helloworld 2019-08-17 02:09:01 UTC email#email.com
✔ helloworld-00008 helloworld 2019-08-17 01:59:38 UTC email#email.com
✔ helloworld-00007 helloworld 2019-08-13 22:58:18 UTC email#email.com
✔ helloworld-00006 helloworld 2019-08-13 22:51:18 UTC email#email.com
✔ helloworld-00005 helloworld 2019-08-13 22:46:14 UTC email#email.com
✔ helloworld-00004 helloworld 2019-08-13 22:41:44 UTC email#email.com
✔ helloworld-00003 helloworld 2019-08-13 22:39:16 UTC email#email.com
✔ helloworld-00002 helloworld 2019-08-13 22:36:06 UTC email#email.com
✔ helloworld-00001 helloworld 2019-08-13 22:30:03 UTC email#email.com
You can also use gcloud run revisions describe to get details about a specific revision, which will contain a status field. For example, an active revision:
$ gcloud run revisions describe helloworld-00009
...
status:
conditions:
- lastTransitionTime: '2019-08-17T02:09:07.871Z'
status: 'True'
type: Ready
- lastTransitionTime: '2019-08-17T02:09:14.027Z'
status: 'True'
type: Active
- lastTransitionTime: '2019-08-17T02:09:07.871Z'
status: 'True'
type: ContainerHealthy
- lastTransitionTime: '2019-08-17T02:09:05.483Z'
status: 'True'
type: ResourcesAvailable
And an inactive revision:
$ gcloud run revisions describe helloworld-00008
...
status:
conditions:
- lastTransitionTime: '2019-08-17T01:59:45.713Z'
status: 'True'
type: Ready
- lastTransitionTime: '2019-08-17T02:39:46.975Z'
message: Revision retired.
reason: Retired
status: 'False'
type: Active
- lastTransitionTime: '2019-08-17T01:59:45.713Z'
status: 'True'
type: ContainerHealthy
- lastTransitionTime: '2019-08-17T01:59:43.142Z'
status: 'True'
type: ResourcesAvailable
You'll specifically want to check the type: Active condition.
This is all available via the Cloud Run REST API as well: https://cloud.google.com/run/docs/reference/rest/v1/namespaces.revisions
By default, the traffic is routed to the latest revision. You can see this into the logs.
Deploying container to Cloud Run service [SERVICE_NAME] in project [YOUR_PROJECT] region [YOUR_REGION]
✓ Deploying... Done.
✓ Creating Revision...
✓ Routing traffic...
Done.
Service [SERVICE_NAME] revision [SERVICE_NAME-00012-yic] has been deployed and is serving 100 percent of traffic at https://SERVICE_NAME-vqg64v3fcq-uc.a.run.app
If you want to be sure, you can explicitly call the update traffic command
gcloud run services update-traffic --platform=managed --region=YOUR_REGION --to-latest YOUR_SERVICE

Istio 1.4.7 - Kiali pod fails to start

After installing the Istio 1.4.7, Kiali pod is not coming up cleanly. Its failing with error - signing key for login tokens is invalid
kubectl get po -n istio-system | gre kiali
NAME READY STATUS RESTARTS AGE
kiali-7ff568c949-v2qmq 0/1 CrashLoopBackOff 56 4h22m
kubectl describe po kiali-7ff568c949-v2qmq -n istio-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 29s default-scheduler Successfully assigned istio-system/kiali-774d68d9c7-4trpd to ip-10-75-64-5.eu-west-2.compute.internal
Normal Pulling 28s kubelet, ip-10-75-64-5.eu-west-2.compute.internal Pulling image "quay.io/kiali/kiali:v1.15.2"
Normal Pulled 27s kubelet, ip-10-75-64-5.eu-west-2.compute.internal Successfully pulled image "quay.io/kiali/kiali:v1.15.2"
Normal Created 12s (x3 over 27s) kubelet, ip-10-75-64-5.eu-west-2.compute.internal Created container kiali
Normal Pulled 12s (x2 over 26s) kubelet, ip-10-75-64-5.eu-west-2.compute.internal Container image "quay.io/kiali/kiali:v1.15.2" already present on machine
Normal Started 11s (x3 over 26s) kubelet, ip-10-75-64-5.eu-west-2.compute.internal Started container kiali
Warning BackOff 5s (x5 over 25s) kubelet, ip-10-75-64-5.eu-west-2.compute.internal Back-off restarting failed container
kubectl logs -n istio-system kiali-7ff568c949-v2qmq
I0429 21:23:11.024691 1 kiali.go:66] Kiali: Version: v1.15.2, Commit: 718aedca76e612e2f95498d022fab1e116613792
I0429 21:23:11.025039 1 kiali.go:205] Using authentication strategy [login]
F0429 21:23:11.025057 1 kiali.go:83] signing key for login tokens is invalid
As #Joel mentioned in comments
see this issue and in particular this comment
and mentioned here
Istio 1.4.7 release does not contain ISTIO-SECURITY-2020-004 fix
The release notes for Istio 1.4.7 state that the security vulnerability relating to Kiali has been fixed; however, the commit to fix this is not present in the release.
As far as I understand from this comment if you use istioctl it should work.
The istioctl installer was fixed.
but
If you installed with the old helm charts, then it wasn't fixed there. I thought the helm charts were deprecated. In any event, add these two lines to the kiali configmap template in the helm chart:
login_token:
signing_key: {{ randAlphaNum 10 | quote }}
If that won't work I would recommend to upgrade to istio version 1.5.1, it should fix the issue.

CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Intermittent DNS issues while pulling docker image from ECR repository

Has anyone facing this issue with docker pull. we recently upgraded docker to 18.03.1-ce from then we are seeing the issue. Although we are not exactly sure if this is related to docker, but just want to know if anyone faced this problem.
We have done some troubleshooting using tcp dump the DNS queries being made were under the permissible limit of 1024 packet. which is a limit on EC2, We also tried working around the issue by modifying the /etc/resolv.conf file to use a higher retry \ timeout value, but that didn't seem to help.
we did a packet capture line by line and found something. we found some responses to be negative. If you use Wireshark, you can use 'udp.stream eq 12' as a filter to view one of the negative answers. we can see the resolver sending an answer "No such name". All these requests that get a negative response use the following name in the request:
354XXXXX.dkr.ecr.us-east-1.amazonaws.com.ec2.internal
Would anyone of you happen to know why ec2.internal is being adding to the end of the DNS? If run a dig against this name it fails. So it appears that a wrong name is being sent to the server which responds with 'no such host'. Is docker is sending a wrong dns name for resolution.
We see this issue happening intermittently. looking forward for help. Thanks in advance.
Expected behaviour
5.0.25_61: Pulling from rrg
Digest: sha256:50bbce4af6749e9a976f0533c3b50a0badb54855b73d8a3743473f1487fd223e
Status: Downloaded newer image forXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Actual behaviour
docker-compose up -d rrg-node-1
Creating rrg-node-1
ERROR: for rrg-node-1 Cannot create container for service rrg-node-1: Error response from daemon: Get https:/XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/v2/: dial tcp: lookup XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com on 10.5.0.2:53: no such host
Steps to reproduce the issue
docker pull XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Output of docker version:
(Docker version 18.03.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3)
Output of docker info:
([ec2-user#ip-10-5-3-45 ~]$ docker info
Containers: 37
Running: 36
Paused: 0
Stopped: 1
Images: 60
Server Version: swarm/1.2.5
Role: replica
Primary: 10.5.4.172:3375
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 12
Plugins:
Volume:
Network:
Log:
Swarm:
NodeID:
Is Manager: false
Node Address:
Kernel Version: 4.14.51-60.38.amzn1.x86_64
Operating System: linux
Architecture: amd64
CPUs: 22
Total Memory: 80.85GiB
Name: mgr1
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false
WARNING: No kernel memory limit support)