Kops cluster on AWS timeout - amazon-web-services

This is really annoying me and I can't seem to find any answers on the internet.
I created a cluster using kops on AWS yesterday and everything worked fine. But for some reason (and this is like the 5th time it happens), I come back 1 or 2 days after and simply cannot access the cluster. All the other times my solution was to delete everything manually and create the cluster again.
Here's my kubectl client version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Here's what I tried:
kubectl get nodes/pods/services/etc -v 7
I1116 22:17:09.368841 1689 loader.go:372] Config loaded from file: /home/ubuntu/.kube/config
I1116 22:17:09.369482 1689 round_trippers.go:432] GET https://<apiUrl>/api?timeout=32s
I1116 22:17:09.369501 1689 round_trippers.go:438] Request Headers:
I1116 22:17:09.369519 1689 round_trippers.go:442] Accept: application/json, */*
I1116 22:17:09.369535 1689 round_trippers.go:442] User-Agent: kubectl/v1.22.3 (linux/amd64) kubernetes/c920368
I1116 22:18:31.932298 1696 round_trippers.go:457] Response Status: in 30003 milliseconds
I1116 22:18:31.932372 1696 cached_discovery.go:121] skipped caching discovery info due to Get "https://<apiUrl>/api?timeout=32s": dial tcp <apiIP>: i/o timeout
update kops cluster
kops update cluster
Nothing happened, no changes need to be applied
Does anyone have any idea what's happening? What am I missing in here?
I'm still a K8S noob, so if you need more info please ask, I'm not quite sure what information can be relevant here.
Thank you

For future reference, the problem is that I was using small, burstable instances both for master and nodes. Those didn't meet the hardware requirements for K8S.

Related

CLI plugin installation problem on Grafana

I'm getting the following error when trying to install a plugin from the Grafana CLI installed on Kubernetes. I deleted and rebuilt the pod, it works fine but the error persists. Other Grafana features are working fine. What can I do?
Failed to send requesterrorGet "https://grafana.com/api/plugins/repo": context deadline exceeded (Client.Timeout exceeded while awaiting headers)Error: ✗ Failed to send request: Get "https://grafana.com/api/plugins/repo": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
BR
You say there is no problem with Grafana working. Kubernetes also does not give an error with the pod. In the error content, it says "Failed to send request: Get". Most likely it can access the internet but not dns resolution. If ping 8.8.8.8 is working but ping google.com is not working, you need to add nameserver.
For this, you can add something like the following into the /etc/resolv.conf file.
nameserver 8.8.8.8

pods are stuck in CrashLoopBackOff after updating my eks to 1.16

I just updated my eks from 1.15 to 1.16 and I couldn't get my deployments in my namespaces up and running. when I do kubectl get po and try to list my pods they're all stuck in CrashLoopBackOff state. I tried describe one pod and this is what I get in the events section
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 56m (x8 over 72m) kubelet Pulling image "xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/xxx-xxxx-xxxx:master.697.7af45fff8e0"
Warning BackOff 75s (x299 over 66m) kubelet Back-off restarting failed container
kuberntets version -
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-eks-e1a842", GitCommit:"e1a8424098604fa0ad8dd7b314b18d979c5c54dc", GitTreeState:"clean", BuildDate:"2021-07-31T01:19:13Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
It seems like your container is stuck in image pull state, here are somethings that you can check.
Ensure image is present in ECR
Ensure the EKS cluster is able to connect to ECR - If it's a private repo it would require credentials.
Run a docker pull and see if it's able to pull it directly (most likely it will fail or ask for credentials if not already passed)
So the problem is I was trying to deploy x86 containers on ARM node instance. Everything worked once I changed my launch template image for my node group

kubectl commands timeout without details

I'm running a Kubernetes cluster, which has worked fine for several months. Now, today, when I was about to deploy some updates, I get timeouts from the server.
Running $ kubectl get nodes yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
Running $ kubectl get pods --all-namespaces yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
Running $ kubectl get deployments yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)
Running $ kubectl get svc yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)
Running $ kubectl cluster-info yields (note no output after the master)
Kubernetes master is running at https://cluster.mysite.com
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
As I get these timeouts for every command, troubleshooting is impossible.
How can I continue from here to access my servers? I'm using kube-aws, and an AWS CloudFormation VPC.
Thanks for your time.
EDIT:
As per request, I ran $ kubectl get pods -v 7 and after a bunch of cache returns got this:
I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I also ran $ kubectl cluster-info dump -v 7 and got:
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362 25644 helpers.go:207] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
"reason": "Timeout",
"details": {
"kind": "nodes",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
}
]
},
"code": 504
}]
EDIT 2:
Okay, now I'm just getting Unable to connect to the server: EOF on every request and I'm starting to get scared. This is a production cluster and I can't even access it to try to troubleshoot. Anyone have a hint on how to proceed?
EDIT 3:
I've gotten as far as realizing that the etcd cluster was not working properly, with 2/3 nodes out of sync. Restarting one node had it properly joining the cluster again, but the second one can't start the services. The services that don't start are:
etcdadm-check.service
etcdadm-save.service
etcdadm-update-status.service
user#0.service
The three former ones all give the error etcdadm-check.service: Control process exited, code=exited status=3 and the last one gives user#0.service: Start request repeated too quickly..
Any hints on how to handle this?
Also, after restoring the second etcd, I get Unable to connect to the server: x509: certificate signed by unknown authority when running any kubectl commands. Does this signify data loss? My certificates are still valid for over half a year, and I haven't changed anything about them.
EDIT 4:
I still have the etcd-issue, but am following the instructions in camil's answer at this time, will update with the result. However, I solved the issue with the certificates not being valid simply by re-running $ kube-aws render credentials with the proper paths to my intermediate root CA, so that issue is solved.
To avoid the timeouts, you can pass this flag --request-timeout='1s'. This will allow further debugging.
I see you are running kube-aws,so it will be safe to terminate the master instances (at least one, if you run multiple masters). The ASG will replace them automatically. You can do this also with the ETCD nodes.
If the issue still persists, then you have to ssh into masters and check the logs and services by running commands like:
journalctl -xe
systemctl status -l kubelet.service
systemctl status -l flanneld.service
systemctl status -l docker.service
rkt list
You can also use this function to debug using kubectl from inside the masters:
kubectl() {
/usr/bin/docker run --rm --net=host \
-v /etc/resolv.conf:/etc/resolv.conf \
-v /srv/kube-aws/plugins:/srv/kube-aws/plugins \
quay.io/coreos/hyperkube:v1.9.0_coreos.0 /hyperkube kubectl "$#"
}
Then try these commands:
kubectl get componentstatus
kubectl cluster-info
kubectl get pods -n kube-system
kubectl get events -n kube-system
Check the connectivity to ETCD from masters
export $(cat /etc/etcd-environment | tr -d "'")
/usr/bin/etcdctl \
--ca-file=/etc/kubernetes/ssl/etcd-trusted-ca.pem \
--cert-file=/etc/kubernetes/ssl/etcd-client.pem \
--key-file=/etc/kubernetes/ssl/etcd-client-key.pem \
--endpoints="${ETCD_ENDPOINTS}" \
cluster-health
rm -r ~/.kube/cache/discovery worked for me.
My timeout messages looked different than yours though:
E0528 20:32:29.191243 1730 request.go:975] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body)

Confd error: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

While debugging I realised that confd doesn't pick up the keys and my journal looks like this:
Sep 18 18:31:50 ip-10-171-54-76.ec2.internal docker[24891]: [nginx] waiting for confd to refresh nginx.conf
Sep 18 18:31:56 ip-10-171-54-76.ec2.internal docker[24891]: 2014-09-18T18:31:56Z 9122c7a54edc confd[9572]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I use nsenter to log in to the running container to run some experiments for debugging purposes. I ran this command
confd -onetime -node 172.17.42.1:4001 -config-file /etc/confd/conf.d/nginx.toml
Then received this error as above
confd[12894]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I am totally clueless at this point. I am using EC2 with the stable version of CoreOS and I am sure that etcd is running on the host. Also, I can ping the host from inside the container successfully.
Any ideas on what's wrong?
Assistance will be much appreciated.
This error indicates that your etcd cluster isn't operating correctly, so confd has nothing to watch. It has probably lost quorum. The logs (journalctl -u etcd) should indicate what happened.

Spark 0.90 Stand alone connection refused

I am using spark 0.90 stand alone mode.
When I tried with a streaming application in stand alone mode, I am getting a connection refused exception.
I added hostname in /etc/hosts also tried with IP alone. In both cases worker got registered with master without any issues.
Is there a way to solve this issue?
14/02/28 07:15:01 INFO Master: akka.tcp://driverClient#127.0.0.1:55891 got disassociated, removing it.
14/02/28 07:15:04 INFO Master: Registering app Twitter Streaming
14/02/28 07:15:04 INFO Master: Registered app Twitter Streaming with ID app-20140228071504-0000
14/02/28 07:34:42 INFO Master: akka.tcp://spark#127.0.0.1:33688 got disassociated, removing it.
14/02/28 07:34:42 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.165.35.96%3A38903-6#-1146558090] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/02/28 07:34:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.165.35.96:8910] -> [akka.tcp://spark#127.0.0.1:33688]: Error [Association failed with [akka.tcp://spark#127.0.0.1:33688]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#127.0.0.1:33688]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:33688
I had a similar issue when running in Spark in cluster mode. My problem was that the server was started with the hostname 'fluentd:7077' and not the FQDN. I edited the
/sbin/start-master.sh
to reflect how my remote nodes connect with the -ip flag.
/usr/lib/jvm/jdk1.7.0_51/bin/java -cp :/home/vagrant/spark-0.9.0-incubating-bin- hadoop2/conf:/home/vagrant/spark-0.9.0-incuba
ting-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.ap
ache.spark.deploy.master.Master --ip fluentd.alex.dev --port 7077 --webui-port 8080
Hope this helps.