CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped - amazon-web-services

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you

I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Related

apiserver pod unable to load configmap based request-header-client-ca-file

I am running an OKD 4.5 cluster with 3 master nodes on AWS, installed using openshift-install.
In attempting to update the cluster to 4.5.0-0.okd-2020-09-04-180756 I have run into numerous issues.
The current issue is the console and apiserver pods on one master server are in crashLoopBackoff do to what appears to be an internal networking issue.
The logs of the apiserver pod are as follows:
Copying system trust bundle
I0911 15:59:15.763716 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
F0911 15:59:19.556715 1 cmd.go:72] unable to load configmap based request-header-client-ca-file: > Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: no route to host
I have tried deleting the pods, and the new ones crashLoop as well.
update
removed the troubled mater, added a new machine to build a new master. apiserver and console are no longer failing, but now etcd is.
#### attempt 9
member={name="ip-172-99-6-251.ec2.internal", peerURLs=[https://172.99.6.251:2380}, clientURLs=[https://172.99.6.251:2379]
member={name="ip-172-99-6-200.ec2.internal", peerURLs=[https://172.99.6.200:2380}, clientURLs=[https://172.99.6.200:2379]
member={name="ip-172-99-6-249.ec2.internal", peerURLs=[https://172.99.6.249:2380}, clientURLs=[https://172.99.6.249:2379]
target=nil, err=<nil>
#### sleeping...
*note 172.99.6.251 is the ip of the master node this one replaced

Cannot reach containers from codebuild

I've been having issue reaching containers from within codebuild. I have an exposed GraphQL service with a downstream auth service and a postgresql database all started through Docker Compose. Running them and testing them works fine locally, however I cannot get the right comination of host names in codebuild.
It looks like my test is able to run if I hit the GraphQL endpoint at 0.0.0.0:8000 however once my GraphQL container attempts to reach the downstream service I will get a connection refused. I've tried reaching the auth service from inside the GraphQL service at auth:8001, 0.0.0.0:8001, with port 8001 exposed, and by setting up a briged network. I am always getting a connection refused error.
I've attached part of my codebuild logs.
Any ideas what I might be missing?
Container 2018/08/28 05:37:17 Running command docker ps CONTAINER ID
IMAGE COMMAND CREATED STATUS PORTS NAMES 6c4ab1fdc980
docker-compose_graphql "app" 1 second ago Up Less than a second
0.0.0.0:8000->8000/tcp docker-compose_graphql_1 5c665f5f812d docker-compose_auth "/bin/sh -c app" 2 seconds ago Up Less than a
second 0.0.0.0:8001->8001/tcp docker-compose_auth_1 b28148784c04
postgres:10.4 "docker-entrypoint..." 2 seconds ago Up 1 second
0.0.0.0:5432->5432/tcp docker-compose_psql_1
Container 2018/08/28 05:37:17 Running command go test ; cd ../..
Register panic: [{"message":"rpc error: code = Unavailable desc = all
SubConns are in TransientFailure, latest connection error: connection
error: desc = \"transport: Error while dialing dial tcp 0.0.0.0:8001:
connect: connection refused\"","path":
From the "host" machine my exposed GraphQL service could only be reached using the IP address 0.0.0.0. The internal networking was set up correctly and each service could be reached at <NAME>:<PORT> as expected, however, upon error the IP address would be shown (172.27.0.1) instead of the host name.
My problem was that all internal connections were not yet ready, leading to the "connection refused" error. The command sleep 5 after docker-compose up gave my services time to fully initialize before testing.

istio-ingress can't start up

When I start minikube and apply istio.yaml
bug the ingress can't start up:
eumji#eumji:~$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
istio-ca-76dddbd695-bdwm9 1/1 Running 5 2d
istio-ingress-85fb769c4d-qtbcx 0/1 CrashLoopBackOff 67 2d
istio-mixer-587fd4bbdb-ldvhb 3/3 Running 15 2d
istio-pilot-7db8db896c-9znqj 2/2 Running 10 2d
When I try to see the log I get following output:
eumji#eumji:~$ kubectl logs -f istio-ingress-85fb769c4d-qtbcx -n istio-system
ERROR: logging before flag.Parse: I1214 05:04:26.193386 1 main.go:68] Version root#24c944bda24b-0.3.0-24ec6a3ac3a1d592d1873d2d8198278a849b8301
ERROR: logging before flag.Parse: I1214 05:04:26.193463 1 main.go:109] Proxy role: proxy.Node{Type:"ingress", IPAddress:"", ID:"istio-ingress-85fb769c4d-qtbcx.istio-system", Domain:"istio-system.svc.cluster.local"}
ERROR: logging before flag.Parse: I1214 05:04:26.193480 1 resolve.go:35] Attempting to lookup address: istio-mixer
ERROR: logging before flag.Parse: I1214 05:04:41.195879 1 resolve.go:42] Finished lookup of address: istio-mixer
Error: lookup failed for udp address: i/o timeout
Usage:
agent proxy [flags]
--serviceregistry string Select the platform for service registry, options are {Kubernetes, Consul, Eureka} (default "Kubernetes")
--statsdUdpAddress string IP Address and Port of a statsd UDP listener (e.g. 10.75.241.127:9125)
--zipkinAddress string Address of the Zipkin service (e.g. zipkin:9411)
Global Flags:
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
-v, --v Level log level for V logs (default 0)
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
ERROR: logging before flag.Parse: E1214 05:04:41.198640 1 main.go:267] lookup failed for udp address: i/o timeout
What could be the reason?
There is not enough information in your post to figure out what may be wrong, in particular it seems that somehow your ingress isn't able to resolve istio-mixer which is unexpected.
Can you file a detailed issue
https://github.com/istio/issues/issues/new
And we can take it from there ?
Thanks
Are you using something like minikube? The quick-start docs give this hint: "Note: If your cluster is running in an environment that does not support an external load balancer (e.g., minikube), the EXTERNAL-IP of istio-ingress says . You must access the application using the service NodePort, or use port-forwarding instead."
https://istio.io/docs/setup/kubernetes/quick-start.html

Kernel error when trying to start an NFS server in a container

I was trying to run through the NFS example in the Kubernetes codebase on Container Engine, but I couldn't get the shares to mount. Turns out every time the nfs-server pod is launched, the kernel is throwing an error:
Apr 27 00:11:06 k8s-cluster-6-node-1 kernel: [60165.482242] ------------[ cut here ]------------
Apr 27 00:11:06 k8s-cluster-6-node-1 kernel: [60165.483060] WARNING: CPU: 0 PID: 7160 at /build/linux-50mAO0/linux-3.16.7-ckt4/fs/nfsd/nfs4recover.c:1195 nfsd4_umh_cltrack_init+0x4a/0x60 nfsd
Full output here: http://pastebin.com/qLzCFpAa
Any thoughts on how to solve this?
The NFS example doesn't work because GKE (by default) doesn't support running privileged containers, such as the nfs-server. I just tested this with a v0.16.0 cluster and kubectl v0.15.0 (the current gcloud default) and got a nice error message when I tried to start the nfs-server pod:
$ kubectl create -f nfs-server-pod.yaml
Error: Pod "nfs-server" is invalid: spec.containers[0].privileged: forbidden 'true'

I have started to receive a 402 error when accessing my CoreOS cluster

I have started to receive a 402 error when accessing my CoreOS cluster. It has been working fine up until a day ago. Anybody has any ideas why I'm receiving this error? I am using the stable channel on EC2.
$ fleetctl list-machines
E0929 09:43:14.823081 00979 fleetctl.go:151] error attempting to check latest fleet version in Registry: 402: Standby Internal Error () [0]
Error retrieving list of active machines: 402: Standby Internal Error () [0]
In this case etcd does not currently have quorum. The "Standby Internal Error" signifies that the node is attempting to act as a standby but is failing to redirect you to the active node. Repairing the etcd issue will fix the problem. Check on the status of etcd by running:
journalctl -u etcd.service on each of the nodes should give you the information that you need to repair etcd in this case.