I am running an OKD 4.5 cluster with 3 master nodes on AWS, installed using openshift-install.
In attempting to update the cluster to 4.5.0-0.okd-2020-09-04-180756 I have run into numerous issues.
The current issue is the console and apiserver pods on one master server are in crashLoopBackoff do to what appears to be an internal networking issue.
The logs of the apiserver pod are as follows:
Copying system trust bundle
I0911 15:59:15.763716 1 dynamic_serving_content.go:111] Loaded a new cert/key pair for "serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
F0911 15:59:19.556715 1 cmd.go:72] unable to load configmap based request-header-client-ca-file: > Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: no route to host
I have tried deleting the pods, and the new ones crashLoop as well.
update
removed the troubled mater, added a new machine to build a new master. apiserver and console are no longer failing, but now etcd is.
#### attempt 9
member={name="ip-172-99-6-251.ec2.internal", peerURLs=[https://172.99.6.251:2380}, clientURLs=[https://172.99.6.251:2379]
member={name="ip-172-99-6-200.ec2.internal", peerURLs=[https://172.99.6.200:2380}, clientURLs=[https://172.99.6.200:2379]
member={name="ip-172-99-6-249.ec2.internal", peerURLs=[https://172.99.6.249:2380}, clientURLs=[https://172.99.6.249:2379]
target=nil, err=<nil>
#### sleeping...
*note 172.99.6.251 is the ip of the master node this one replaced
Related
I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.
I am trying to make Cassandra run on Google Cloud using external ip of the VM. But I am getting error Failed to bind port 9042 on 34.89.109.98. As far as I can see, I have followed the rules of setting firewall rules but I am still not able to resolve the issue. I have attached the pics of my configuration for your reference.
1) The firewall rule is
2) The list of all the rules is
3) The VM is
More Information
I followed the steps in https://linuxize.com/post/how-to-install-apache-cassandra-on-debian-9/ to install Cassandra. This automatically started cassandra. Then I killed cassandra, changed the ip address to external IP in cassandra.yaml file and started it again. It didn't work. Then I started working around with VPN settings.
Part of the message dump after I issue the command to start cassandra /usr/sbin/cassandra -f
INFO [main] 2019-12-18 16:09:40,755 StorageService.java:1521 - JOINING: Finish joining ring
INFO [main] 2019-12-18 16:09:40,826 StorageService.java:2442 - Node localhost/127.0.0.1 state jump to NORMAL
INFO [main] 2019-12-18 16:09:41,027 NativeTransportService.java:68 - Netty using native Epoll event loop
INFO [main] 2019-12-18 16:09:41,071 Server.java:158 - Using Netty Version: [netty-buffer=netty-buffer-4.0.44.Final
.452812a, netty-codec=netty-codec-4.0.44.Final.452812a, netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812
a, netty-codec-http=netty-codec-http-4.0.44.Final.452812a, netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a
, netty-common=netty-common-4.0.44.Final.452812a, netty-handler=netty-handler-4.0.44.Final.452812a, netty-tcnative=
netty-tcnative-1.1.33.Fork26.142ecbb, netty-transport=netty-transport-4.0.44.Final.452812a, netty-transport-native-
epoll=netty-transport-native-epoll-4.0.44.Final.452812a, netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452
812a, netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, netty-transport-udt=netty-transport-udt-4.0.4
4.Final.452812a]
INFO [main] 2019-12-18 16:09:41,071 Server.java:159 - Starting listening for CQL clients on /35.197.238.136:9042 (
unencrypted)...
Exception (java.lang.IllegalStateException) encountered during startup: Failed to bind port 9042 on 35.197.238.136.
java.lang.IllegalStateException: Failed to bind port 9042 on 35.197.238.136.
at org.apache.cassandra.transport.Server.start(Server.java:163)
at java.util.Collections$SingletonSet.forEach(Collections.java:4769)
at org.apache.cassandra.service.NativeTransportService.start(NativeTransportService.java:124)
at org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:696)
at org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:546)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:635)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:742)
ERROR [main] 2019-12-18 16:09:41,100 CassandraDaemon.java:759 - Exception encountered during startup
java.lang.IllegalStateException: Failed to bind port 9042 on 35.197.238.136.
at org.apache.cassandra.transport.Server.start(Server.java:163) ~[apache-cassandra-3.11.5.jar:3.11.5]
at java.util.Collections$SingletonSet.forEach(Collections.java:4769) ~[na:1.8.0_232]
at org.apache.cassandra.service.NativeTransportService.start(NativeTransportService.java:124) ~[apache-cass
andra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:696) [apache-cass
andra-3.11.5.jar:3.11.5]
at org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:546) [apache-cassandra-3.11.5.ja
r:3.11.5]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:635) [apache-cassandra-3.11.5
.jar:3.11.5]
Within the Cassandra cassandra.yaml file you can bind your Cassandra server to an IP address on which it is listening. The default is 127.0.0.1 (localhost) and is not suitable for external connections.
The address values you can use are the addresses that the Compute Engine has associated with it. These can be discovered using:
ip addr
It is important to realize that a Compute Engine may appear to have a public IP address when shown in the GCP Console, but that is not a network interface on the Compute Engine. In the example in your original question, the Compute Engine IP address would be 10.154.0.4. This is the address you want to set in your configuration file.
See also this document which describes setting up Cassandra on GCP:
Spinning up a Cassandra Cluster on Google Cloud (for free) with just a browser
OpenShift AWS installer fails waiting for Kubernetes API to be available with fatal error "waiting for Kubernetes API: context deadline exceeded":
$ openshift-install create cluster --dir=$HOME/openshift --log-level debug
...
DEBUG Still waiting for the Kubernetes API: Get https://api.cluster-name.'IP_ADDRESS'.nip.io:6443/version?timeout=32s: dial tcp 'IP_ADDRESS':6443: i/o timeout
DEBUG Still waiting for the Kubernetes API: Get https://api.cluster-name.'IP_ADDRESS'.nip.io:6443/version?timeout=32s: dial tcp 'IP_ADDRESS':6443: i/o timeout
DEBUG Still waiting for the Kubernetes API: Get https://api.cluster-name.'IP_ADDRESS'.nip.io:6443/version?timeout=32s: dial tcp 'IP_ADDRESS':6443: i/o timeout
DEBUG Still waiting for the Kubernetes API: Get https://api.cluster-name.'IP_ADDRESS'.nip.io:6443/version?timeout=32s: dial tcp 'IP_ADDRESS':6443: i/o timeout
DEBUG Fetching "Install Config"...
DEBUG Loading "Install Config"...
DEBUG Loading "SSH Key"...
DEBUG Using "SSH Key" loaded from state file
DEBUG Loading "Base Domain"...
DEBUG Loading "Platform"...
DEBUG Using "Platform" loaded from state file
DEBUG Using "Base Domain" loaded from state file
DEBUG Loading "Cluster Name"...
DEBUG Loading "Base Domain"...
DEBUG Using "Cluster Name" loaded from state file
DEBUG Loading "Pull Secret"...
DEBUG Using "Pull Secret" loaded from state file
DEBUG Loading "Platform"...
DEBUG Using "Install Config" loaded from state file
DEBUG Reusing previously-fetched "Install Config"
INFO Use the following commands to gather logs from the cluster
...
FATAL waiting for Kubernetes API: context deadline exceeded
The problem is also described here
In my case the installer tried to connect to Kubernetes API linked to a non-existing endpoint. One of indications of that if oc-client hangs when run a simple command like oc whoami - it actually tries to connect to the same endpoint (taken that KUBECONFIG is set).
It turned out that it has to do with Route 53 hosted zone and in particular with a subdomain. When OpenShift is being installed against a subdomain (like in my case), a record set in a main domain referencing to the subdomain needs to be created. So, for openshift.example.com do the following in aws console:
Go to Route 53 -> Hosted zones -> click openshift.example.com. (if it's not there - create a hosted zone) -> copy NS records, e.g.:
ns-711.awsdns-24.net.
ns-126.awsdns-15.com.
ns-1274.awsdns-31.org.
ns-1556.awsdns-02.co.uk.
Back to Hosted Zones -> example.com. -> Create Record Set:
create a record set for openshift.example.com, type: NS - Name server, Value: paste copied NS records.
After this change the installation went through successfully.
I've been having issue reaching containers from within codebuild. I have an exposed GraphQL service with a downstream auth service and a postgresql database all started through Docker Compose. Running them and testing them works fine locally, however I cannot get the right comination of host names in codebuild.
It looks like my test is able to run if I hit the GraphQL endpoint at 0.0.0.0:8000 however once my GraphQL container attempts to reach the downstream service I will get a connection refused. I've tried reaching the auth service from inside the GraphQL service at auth:8001, 0.0.0.0:8001, with port 8001 exposed, and by setting up a briged network. I am always getting a connection refused error.
I've attached part of my codebuild logs.
Any ideas what I might be missing?
Container 2018/08/28 05:37:17 Running command docker ps CONTAINER ID
IMAGE COMMAND CREATED STATUS PORTS NAMES 6c4ab1fdc980
docker-compose_graphql "app" 1 second ago Up Less than a second
0.0.0.0:8000->8000/tcp docker-compose_graphql_1 5c665f5f812d docker-compose_auth "/bin/sh -c app" 2 seconds ago Up Less than a
second 0.0.0.0:8001->8001/tcp docker-compose_auth_1 b28148784c04
postgres:10.4 "docker-entrypoint..." 2 seconds ago Up 1 second
0.0.0.0:5432->5432/tcp docker-compose_psql_1
Container 2018/08/28 05:37:17 Running command go test ; cd ../..
Register panic: [{"message":"rpc error: code = Unavailable desc = all
SubConns are in TransientFailure, latest connection error: connection
error: desc = \"transport: Error while dialing dial tcp 0.0.0.0:8001:
connect: connection refused\"","path":
From the "host" machine my exposed GraphQL service could only be reached using the IP address 0.0.0.0. The internal networking was set up correctly and each service could be reached at <NAME>:<PORT> as expected, however, upon error the IP address would be shown (172.27.0.1) instead of the host name.
My problem was that all internal connections were not yet ready, leading to the "connection refused" error. The command sleep 5 after docker-compose up gave my services time to fully initialize before testing.
I have started to receive a 402 error when accessing my CoreOS cluster. It has been working fine up until a day ago. Anybody has any ideas why I'm receiving this error? I am using the stable channel on EC2.
$ fleetctl list-machines
E0929 09:43:14.823081 00979 fleetctl.go:151] error attempting to check latest fleet version in Registry: 402: Standby Internal Error () [0]
Error retrieving list of active machines: 402: Standby Internal Error () [0]
In this case etcd does not currently have quorum. The "Standby Internal Error" signifies that the node is attempting to act as a standby but is failing to redirect you to the active node. Repairing the etcd issue will fix the problem. Check on the status of etcd by running:
journalctl -u etcd.service on each of the nodes should give you the information that you need to repair etcd in this case.