kubectl commands timeout without details - amazon-web-services

I'm running a Kubernetes cluster, which has worked fine for several months. Now, today, when I was about to deploy some updates, I get timeouts from the server.
Running $ kubectl get nodes yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
Running $ kubectl get pods --all-namespaces yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
Running $ kubectl get deployments yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)
Running $ kubectl get svc yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)
Running $ kubectl cluster-info yields (note no output after the master)
Kubernetes master is running at https://cluster.mysite.com
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
As I get these timeouts for every command, troubleshooting is impossible.
How can I continue from here to access my servers? I'm using kube-aws, and an AWS CloudFormation VPC.
Thanks for your time.
EDIT:
As per request, I ran $ kubectl get pods -v 7 and after a bunch of cache returns got this:
I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I also ran $ kubectl cluster-info dump -v 7 and got:
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362 25644 helpers.go:207] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
"reason": "Timeout",
"details": {
"kind": "nodes",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
}
]
},
"code": 504
}]
EDIT 2:
Okay, now I'm just getting Unable to connect to the server: EOF on every request and I'm starting to get scared. This is a production cluster and I can't even access it to try to troubleshoot. Anyone have a hint on how to proceed?
EDIT 3:
I've gotten as far as realizing that the etcd cluster was not working properly, with 2/3 nodes out of sync. Restarting one node had it properly joining the cluster again, but the second one can't start the services. The services that don't start are:
etcdadm-check.service
etcdadm-save.service
etcdadm-update-status.service
user#0.service
The three former ones all give the error etcdadm-check.service: Control process exited, code=exited status=3 and the last one gives user#0.service: Start request repeated too quickly..
Any hints on how to handle this?
Also, after restoring the second etcd, I get Unable to connect to the server: x509: certificate signed by unknown authority when running any kubectl commands. Does this signify data loss? My certificates are still valid for over half a year, and I haven't changed anything about them.
EDIT 4:
I still have the etcd-issue, but am following the instructions in camil's answer at this time, will update with the result. However, I solved the issue with the certificates not being valid simply by re-running $ kube-aws render credentials with the proper paths to my intermediate root CA, so that issue is solved.

To avoid the timeouts, you can pass this flag --request-timeout='1s'. This will allow further debugging.
I see you are running kube-aws,so it will be safe to terminate the master instances (at least one, if you run multiple masters). The ASG will replace them automatically. You can do this also with the ETCD nodes.
If the issue still persists, then you have to ssh into masters and check the logs and services by running commands like:
journalctl -xe
systemctl status -l kubelet.service
systemctl status -l flanneld.service
systemctl status -l docker.service
rkt list
You can also use this function to debug using kubectl from inside the masters:
kubectl() {
/usr/bin/docker run --rm --net=host \
-v /etc/resolv.conf:/etc/resolv.conf \
-v /srv/kube-aws/plugins:/srv/kube-aws/plugins \
quay.io/coreos/hyperkube:v1.9.0_coreos.0 /hyperkube kubectl "$#"
}
Then try these commands:
kubectl get componentstatus
kubectl cluster-info
kubectl get pods -n kube-system
kubectl get events -n kube-system
Check the connectivity to ETCD from masters
export $(cat /etc/etcd-environment | tr -d "'")
/usr/bin/etcdctl \
--ca-file=/etc/kubernetes/ssl/etcd-trusted-ca.pem \
--cert-file=/etc/kubernetes/ssl/etcd-client.pem \
--key-file=/etc/kubernetes/ssl/etcd-client-key.pem \
--endpoints="${ETCD_ENDPOINTS}" \
cluster-health

rm -r ~/.kube/cache/discovery worked for me.
My timeout messages looked different than yours though:
E0528 20:32:29.191243 1730 request.go:975] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body)

Related

Kops cluster on AWS timeout

This is really annoying me and I can't seem to find any answers on the internet.
I created a cluster using kops on AWS yesterday and everything worked fine. But for some reason (and this is like the 5th time it happens), I come back 1 or 2 days after and simply cannot access the cluster. All the other times my solution was to delete everything manually and create the cluster again.
Here's my kubectl client version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Here's what I tried:
kubectl get nodes/pods/services/etc -v 7
I1116 22:17:09.368841 1689 loader.go:372] Config loaded from file: /home/ubuntu/.kube/config
I1116 22:17:09.369482 1689 round_trippers.go:432] GET https://<apiUrl>/api?timeout=32s
I1116 22:17:09.369501 1689 round_trippers.go:438] Request Headers:
I1116 22:17:09.369519 1689 round_trippers.go:442] Accept: application/json, */*
I1116 22:17:09.369535 1689 round_trippers.go:442] User-Agent: kubectl/v1.22.3 (linux/amd64) kubernetes/c920368
I1116 22:18:31.932298 1696 round_trippers.go:457] Response Status: in 30003 milliseconds
I1116 22:18:31.932372 1696 cached_discovery.go:121] skipped caching discovery info due to Get "https://<apiUrl>/api?timeout=32s": dial tcp <apiIP>: i/o timeout
update kops cluster
kops update cluster
Nothing happened, no changes need to be applied
Does anyone have any idea what's happening? What am I missing in here?
I'm still a K8S noob, so if you need more info please ask, I'm not quite sure what information can be relevant here.
Thank you
For future reference, the problem is that I was using small, burstable instances both for master and nodes. Those didn't meet the hardware requirements for K8S.

Debezium with AWS MSK NOT_ENOUGH_REPLICAS

I have a running debezium cluster in AWS, no issues with that. I want to give a try with AWS MSK. So I launched a cluster. Then I launched an EC2 for running my connectors.
Then installed confluent-kafka
sudo apt-get update && sudo apt-get install confluent-platform-2.12
By default the AWS MSK doesn't have schema registry, So I configured it from the connector EC2
Schema registry conf file:
kafkastore.connection.url=z-1.bhuvi-XXXXXXXXX.amazonaws.com:2181,z-3.bhuvi-XXXXXXXXX.amazonaws.com:2181,z-2.bhuvi-XXXXXXXXX.amazonaws.com:2181
kafkastore.bootstrap.servers=PLAINTEXT://b-2.bhuvi-XXXXXXXXX.amazonaws.com:9092,PLAINTEXT://b-4.bhuvi-XXXXXXXXX.amazonaws.com:9092,PLAINTEXT://b-1.bhuvi-XXXXXXXXX.amazonaws.com:9092
Then /etc/kafka/connect-distributed.properties file
bootstrap.servers=b-4.bhuvi-XXXXXXXXX.amazonaws.com:9092,b-3.bhuvi-XXXXXXXXX.amazonaws.com:9092,b-2.bhuvi-XXXXXXXXX.amazonaws.com:9092
plugin.path=/usr/share/java,/usr/share/confluent-hub-components
Install connector:
confluent-hub install debezium/debezium-connector-mysql:latest
start the service
systemctl start confluent-schema-registry
systemctl start confluent-connect-distributed
Now everything started. Then I created a mysql.json file.
{
"name": "mysql-connector-db01",
"config": {
"name": "mysql-connector-db01",
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.server.id": "1",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092",
"database.history.kafka.topic": "schema-changes.mysql",
"database.server.name": "mysql-db01",
"database.hostname": "172.31.84.129",
"database.port": "3306",
"database.user": "bhuvi",
"database.password": "my_stong_password",
"database.whitelist": "proddb,test",
"internal.key.converter.schemas.enable": "false",
"key.converter.schemas.enable": "false",
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"internal.value.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
"transforms.unwrap.add.source.fields": "ts_ms",
}
}
Create debezium connector
curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d #mysql.josn
Then its stated giving this error in the connector EC2.
Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,290] WARN [Producer clientId=producer-3] Got error produce response with correlation id 844 on topic-partition connect-configs-0, retrying (2147482809 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637)
Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,391] WARN [Producer clientId=producer-3] Got error produce response with correlation id 845 on topic-partition connect-configs-0, retrying (2147482808 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637)
Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,492] WARN [Producer clientId=producer-3] Got error produce response with correlation id 846 on topic-partition connect-configs-0, retrying (2147482807 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637)
Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,593] WARN [Producer clientId=producer-3] Got error produce response with correlation id 847 on topic-partition connect-configs-0, retrying (2147482806 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637)
It never stops this error message.
Describe of connect-configs
Topic:connect-configs PartitionCount:1 ReplicationFactor:1 Configs:cleanup.policy=compact
Topic: connect-configs Partition: 0 Leader: 2 Replicas: 2 Isr: 2
MSK sets min.in.sync.replicas to 2 for all topics by default (see https://docs.aws.amazon.com/msk/latest/developerguide/msk-default-configuration.html)
It possible that Kafka Connect is producing using ACKs="all" and, since you only have one copy of your topic, it never achieves enough quorum.

Jenkins CLI download fails with 500 Error or 503 Error

I am using an EC2 instance to setup Jenkins. In the EC2 instance, I bootstrap the following script to download the Jenkins CLI in my EC2 instance.
sudo wget http://127.0.0.1:8080/jenkins/jnlpJars/jenkins-cli.jar
But when this command executes I receive the following error
http://127.0.0.1:8080/jenkins/jnlpJars/jenkins-cli.jar Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
ERROR 503: Service Unavailable.
I googled and found that it might be due to Jenkins is bootstrapping so for that I added following code in the script to wait for the Jenkins to start.
function jenkinsCreationWait () {
echo "Jenkins Creation Wait"
while ! test -f "/var/lib/jenkins/secrets/initialAdminPassword"; do
sleep 5
echo "Jenkins is booting, please wait..."
done
echo "End"
}
I placed this code before executing the
sudo wget http://127.0.0.1:8080/jenkins/jnlpJars/jenkins-cli.jar
After receiving the error during bootstrap, I SSH into EC2 instance and the tried executing the above command I received 500
http://127.0.0.1:8080/jenkins/jnlpJars/jenkins-cli.jar
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 500 Server Error
ERROR 500: Server Error.
Although this was working fine some hours ago.
I assume you're using user data. Here is a shell script to wait until you get a 200 code
while [[ "$(curl --insecure -s -o /dev/null -w "%{http_code}" http://localhost:8080/jnlpJars/jenkins-cli.jar)" != "200" ]]; do sleep 15; done
wget http://127.0.0.1:8080/jenkins/jnlpJars/jenkins-cli.jar
reference wait for http 200

cf create-service-broker fails with connection refused

I'm experimenting with CF in my local bosh-lite setup.
The apps that I deploy into if work well. I am now trying to follow the steps here
https://github.com/cf-platform-eng/cf-community-workshop/blob/master/demos/service-broker-lab.adoc
to try out the custom service broker setup.
The https://github.com/mstine/haash-broker application starts and is running fine:
$ cf apps
name requested state instances memory disk urls
haash-broker started 1/1 768M 1G haash-broker.vbox.mojito, haash-broker.192.168.50.6.xip.io
I can access it from my host machine browser well:
http://haash-broker.192.168.50.6.xip.io/v2/catalog
But when I execute the
cf create-service-broker haash-broker warreng natedogg http://haash-broker.192.168.50.6.xip.io
I get
$ cf create-service-broker haash-broker warreng natedogg http://haash-broker.192.168.50.6.xip.io
Creating service broker haash-broker as admin...
FAILED
Server error, status code: 502, error code: 10001, message: The service broker could not be reached: http://haash-broker.192.168.50.6.xip.io/v2/catalog
When I log in into the CC VM:
$ bosh -e vbox -f cf ssh api/eb4cec99-bab1-4513-a980-fb92775ac2d8
I can ping the hostname:
api/eb4cec99-bab1-4513-a980-fb92775ac2d8:~$ sudo ping haash-broker.192.168.50.6.xip.io
PING haash-broker.192.168.50.6.xip.io (192.168.50.6) 56(84) bytes of data.
64 bytes from 192.168.50.6: icmp_seq=1 ttl=64 time=0.080 ms
But wget connection gets refused:
api/eb4cec99-bab1-4513-a980-fb92775ac2d8:~$ wget http://warreng:natedogg#haash-broker.192.168.50.6.xip.io/v2/catalog
--2018-04-06 04:19:05-- http://warreng:*password*#haash-broker.192.168.50.6.xip.io/v2/catalog
Resolving haash-broker.192.168.50.6.xip.io (haash-broker.192.168.50.6.xip.io)... 192.168.50.6
Connecting to haash-broker.192.168.50.6.xip.io (haash-broker.192.168.50.6.xip.io)|192.168.50.6|:80... failed: Connection refused.
The firewall permits everything on that VM (sudo iptables -L).
The hostname gets resolved properly. The ping works and the 80 port is open on the target IP, since I can reach it from my host browser.
How can that be that the wget doesn't work in such situation?
This also seems to be the reason for me failing to create a service broker cf create-service-broker
UPDATE
I've managed to to execute the cf create-service-broker command with URL of an nginx reverse proxy running outside of my bosh-lite environment. The proxy redirects to the same initial URL http://haash-broker.192.168.50.6.xip.io
and the command succeeds in this way.
But the subsequent
cf create-service-broker haash-broker warreng natedogg http://haash-broker.192.168.50.1.xip.io:9999
cf enable-service-access haash
cf create-service HaaSh basic my-hash
(where haash-broker.192.168.50.1.xip.io:9999 is my nginx proxy) fails with
Server error, status code: 502, error code: 10001, message: The service broker rejected the request to http://haash-broker.192.168.50.1.xip.io:9999/v2/service_instances/4ef19154-d238-4cb3-8003-803fba53af3f?accepts_incomplete=true. Status Code: 400 Bad Request, Body: {"timestamp":1523008856993,"error":"Bad Request","status":400,"message":""}
I can see in both nginx and broker app logs that the the request reaches the broker and it answers with 400.
Debugging now why.
Can you post the result of --server-response option used with wget? Also what happens when you try to curl the broker?
Broker requires credentials, but it is interesting if it responds with 401 or 500 on the first request that wget makes without credentials.

Kubeadm why does my node not show up though kubelet says it joined?

I am setting up a Kubernetes deployment using auto-scaling groups and Terraform. The kube master node is behind an ELB to get some reliability in case of something going wrong. The ELB has the health check set to tcp 6443, and tcp listeners for 8080, 6443, and 9898. All of the instances and the load balancer belong to a security group that allows all traffic between members of the group, plus public traffic from the NAT Gateway address. I created my AMI using the following script (from the getting started guide)...
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
# cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
# apt-get update
# # Install docker if you don't have it already.
# apt-get install -y docker.io
# apt-get install -y kubelet kubeadm kubectl kubernetes-cni
I use the following user data scripts...
kube master
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
kubeadm init \
--external-etcd-endpoints=http://${etcd_elb}:2379 \
--token=${token} \
--use-kubernetes-version=${k8s_version} \
--api-external-dns-names=kmaster.${master_elb_dns} \
--cloud-provider=aws
until kubectl cluster-info
do
sleep 1
done
kubectl apply -f https://git.io/weave-kube
kube node
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
until kubeadm join --token=${token} kmaster.${master_elb_dns}
do
sleep 1
done
Everything seems to work properly. The master comes up and responds to kubectl commands, with pods for discovery, dns, weave, controller-manager, api-server, and scheduler. kubeadm has the following output on the node...
Running pre-flight checks
<util/tokens> validating provided token
<node/discovery> created cluster info discovery client, requesting info from "http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0"
node/discovery> failed to request cluster info, will try again: [Get http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0: EOF]
<node/discovery> cluster info object received, verifying signature using given token
<node/discovery> cluster info signature and contents are valid, will use API endpoints [https://10.253.129.106:6443]
<node/bootstrap> trying to connect to endpoint https://10.253.129.106:6443
<node/bootstrap> detected server version v1.4.4
<node/bootstrap> successfully established connection with endpoint https://10.253.129.106:6443
<node/csr> created API client to obtain unique certificate for this node, generating keys and certificate signing request
<node/csr> received signed certificate from the API server:
Issuer: CN=kubernetes | Subject: CN=system:node:ip-10-253-130-44 | CA: false
Not before: 2016-10-27 18:46:00 +0000 UTC Not After: 2017-10-27 18:46:00 +0000 UTC
<node/csr> generating kubelet configuration
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
Node join complete:
* Certificate signing request sent to master and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on the master to see this machine join.
Unfortunately, running kubectl get nodes on the master only returns itself as a node. The only interesting thing I see in /var/log/syslog is
Oct 27 21:19:28 ip-10-252-39-25 kubelet[19972]: E1027 21:19:28.198736 19972 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-10-253-130-44' not found
Oct 27 21:19:31 ip-10-252-39-25 kubelet[19972]: E1027 21:19:31.778521 19972 kubelet_node_status.go:301] Error updating node status, will retry: error getting node "ip-10-253-130-44": nodes "ip-10-253-130-44" not found
I am really not sure where to look...
The Hostnames of the two machines (master and the node) should be different. You can check them by running cat /etc/hostname. If they do happen to be the same, edit that file to make them different and then do a sudo reboot to apply the changes. Otherwise kubeadm will not be able to differentiate between the two machines and it will show as a single one in kubectl get nodes.
Yes , I faced the same problem.
I resolved by:
killall kubelet
run the kubectl join command again
and start the kubelet service