I have started to receive a 402 error when accessing my CoreOS cluster - amazon-web-services

I have started to receive a 402 error when accessing my CoreOS cluster. It has been working fine up until a day ago. Anybody has any ideas why I'm receiving this error? I am using the stable channel on EC2.
$ fleetctl list-machines
E0929 09:43:14.823081 00979 fleetctl.go:151] error attempting to check latest fleet version in Registry: 402: Standby Internal Error () [0]
Error retrieving list of active machines: 402: Standby Internal Error () [0]

In this case etcd does not currently have quorum. The "Standby Internal Error" signifies that the node is attempting to act as a standby but is failing to redirect you to the active node. Repairing the etcd issue will fix the problem. Check on the status of etcd by running:
journalctl -u etcd.service on each of the nodes should give you the information that you need to repair etcd in this case.

Related

apiserver pod unable to load configmap based request-header-client-ca-file

I am running an OKD 4.5 cluster with 3 master nodes on AWS, installed using openshift-install.
In attempting to update the cluster to 4.5.0-0.okd-2020-09-04-180756 I have run into numerous issues.
The current issue is the console and apiserver pods on one master server are in crashLoopBackoff do to what appears to be an internal networking issue.
The logs of the apiserver pod are as follows:
Copying system trust bundle
I0911 15:59:15.763716 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
F0911 15:59:19.556715 1 cmd.go:72] unable to load configmap based request-header-client-ca-file: > Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: no route to host
I have tried deleting the pods, and the new ones crashLoop as well.
update
removed the troubled mater, added a new machine to build a new master. apiserver and console are no longer failing, but now etcd is.
#### attempt 9
member={name="ip-172-99-6-251.ec2.internal", peerURLs=[https://172.99.6.251:2380}, clientURLs=[https://172.99.6.251:2379]
member={name="ip-172-99-6-200.ec2.internal", peerURLs=[https://172.99.6.200:2380}, clientURLs=[https://172.99.6.200:2379]
member={name="ip-172-99-6-249.ec2.internal", peerURLs=[https://172.99.6.249:2380}, clientURLs=[https://172.99.6.249:2379]
target=nil, err=<nil>
#### sleeping...
*note 172.99.6.251 is the ip of the master node this one replaced

CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Spark shuts down after 10 seconds of running

I'm trying to setup clusters in my AWS account (Amazon). I followed this tutorial to set it up. I've ran into some problems regarding ports but I finally got it to work until... it shut down after 10 seconds giving me no more than this error:
16/05/12 12:52:46 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:06 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/05/12 12:53:26 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
This was the bash I ran to make it work:
bin/spark-shell --master spark://ip-to-my-machine:7077
I opened the TCP port 7077, what seems to be the problem?

Confd error: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

While debugging I realised that confd doesn't pick up the keys and my journal looks like this:
Sep 18 18:31:50 ip-10-171-54-76.ec2.internal docker[24891]: [nginx] waiting for confd to refresh nginx.conf
Sep 18 18:31:56 ip-10-171-54-76.ec2.internal docker[24891]: 2014-09-18T18:31:56Z 9122c7a54edc confd[9572]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I use nsenter to log in to the running container to run some experiments for debugging purposes. I ran this command
confd -onetime -node 172.17.42.1:4001 -config-file /etc/confd/conf.d/nginx.toml
Then received this error as above
confd[12894]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I am totally clueless at this point. I am using EC2 with the stable version of CoreOS and I am sure that etcd is running on the host. Also, I can ping the host from inside the container successfully.
Any ideas on what's wrong?
Assistance will be much appreciated.
This error indicates that your etcd cluster isn't operating correctly, so confd has nothing to watch. It has probably lost quorum. The logs (journalctl -u etcd) should indicate what happened.

I got some errors when install single node mode

log:
error : gnutls_handshake() failed : a TLS packet with unexpected length was received .
while accessing https://github.com/cloudfoundry/vcap/info/refs
fatal: http request failed
.. unable to clone cloudfoundry vcap repo
What can I do?
Thanks tt64
are you behind a firewall? Are you using vcap dev_setup to install your CF instance? If you are make sure you are not behind any sort of firewalls that would block your execution from grabbing anything it needs from the net.