I created a new cluster in Amazon AWS, with 3 instances, following the instructions on the Datastax installation page for Cassandra.
I also set up all of the things in the cassandra.yaml file, as provided by the instructions on Datastax for adding a new node. They all match.
Now, when I run nodetool status, I get 3 nodes in my cluster, which is great! However, 2 of the nodes are down. The ones that are down are the ones that are not the node I am running nodetool status from.
For example: If I run nodetool status from node 1, it shows node0 and node2 down. If I run nodetool status from node0, it shows node1 and node2 down.
Anyone know why this is happening? I set up the security groups as it said to in the documentation.
Thanks
Related
Env info and setup:
AWS
EKS
Auto Scaling Group (ASG)
Action steps:
scale out nodes from 3 to 4, lets name the node as node1-4, and multiple pods are running on node1-3
after the new node (node4) status is ready (by checking kubectl get node -n my-ns)
on aws console, protect node2-4 for scale in
scale in nodes from 4 to 3, node1 starts to terminate
pods that running on node1 starts to evict
pods evicted from node1 starts to re-create on node2-4
Now tricky things happen:
pod re-created on node2-3 (old nodes) becomes READY quickly, this is expected and normal
pod re-created on node4(newly scale out in action step 1) keep restarting, finally some got READY at 12 mins and the latest one got READY at 57 mins, which is quite abnormal
daemonset pod re-created on node4 becomes READY within 1 min.
Investigation:
check log for the pod that keep restarting, found:
"Startup probe failed ..... http://x.x.x.x:8080/actuator/health connection refused"
once the pod becomes READY afer multiple restart, mannually delete it can get READY again quickly without restart
manually delete the pod that keeps restarting won't help
Any hints why this could happen?
I use rancher to create an EC2 cluster on aws, and I get stuck in "Waiting to register with Kubernetes" every time, as shown in the figure below.
You can see the error message "Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s)" on the Nodes page of Rancher UI. Does anyone know how to solve it?
This is the screenshot with the error
Follow installation guide carefully.
When you are ready to create cluster you have to add node with etcd role.
Each node role (i.e. etcd, Control Plane, and Worker) should be
assigned to a distinct node pool. Although it is possible to assign
multiple node roles to a node pool, this should not be done for
production clusters.
The recommended setup is to have a node pool with the etcd node role
and a count of three, a node pool with the Control Plane node role and
a count of at least two, and a node pool with the Worker node role and
a count of at least two.
Only after that Rancher is seting up cluster.
You can check the exact error (either dns or certificate) by logging into the host nodes and seeing the logs of the container (docker logs).
download the keys and try to ssh to the nodes to see more concrete error messages.
I'm working in a HDP-3.1.0.0 environment, the HDFS version I'm using is the 3.1.1.3.1, the cluster is composed by 2 Namenodes and 4 Datanodes.
After a reboot of the HDP services (stop all and start all), the cluster seems working well, but I see the following alert:
How can I investigate this problem?
The services in my cluster don't have problems, except from the HBase Region Servers (0/4 live) and the Ambari Metrics Collector. I'm not using Hbase, so I didn't pay attention to it, could it be the root cause? I have tried to start the Ambari Metrics Collector but it always fails.
Am using EMR 5.26 cluster version in AWS and it supports having multiple Master nodes(3 Master nodes). This is to remove the single point failure of the cluster. When a master node gets terminated, another node takes it's place as master node and keeps the EMR Cluster and it's steps running.
Here the issue is, am trying to track the exact time when a master node goes into problem(termination). And also the time taken by another node to take it's place and become the new Master node.
Couldn't find any detailed documentation on tracking the failure of master node in AWS Multi master cluster and hence posting it here.
I have Redisson cluster configuration below in yaml file,
subscriptionConnectionMinimumIdleSize: 1
subscriptionConnectionPoolSize: 50
slaveConnectionMinimumIdleSize: 32
slaveConnectionPoolSize: 64
masterConnectionMinimumIdleSize: 32
masterConnectionPoolSize: 64
readMode: "SLAVE"
subscriptionMode: "SLAVE"
nodeAddresses:
- "redis://X.X.X.X:6379"
- "redis://Y.Y.Y.Y:6379"
- "redis://Z.Z.Z.Z:6379"
I understand it is enough to give one of master node ip address in the configuration and Redisson automatically identifies all the nodes in the cluster, but my questions are below,
1 Are all nodes identified at the boot of the application and used for future connections?
2 what if one of the master node goes down, when the application is running, the request to the particular master will fail and the redisson api automatically tries contacting the other nodes(master) or does it try to connect to same master node repeatedly and fail?
3 Is it a best practice to give DNS instead of server ip's?
Answering to your questions:
That's correct, all nodes are identified at the boot process. If you use Config.readMode = MASTER_SLAVE or SLAVE (which is default) then all nodes will be used. If you use Config.readMode = MASTER then only master node is used.
Redisson tries to reach master node until the moment of Redis topology update. Till that moment it doesn't have information about elected new master node.
Cloud services like AWS Elasticache and Azure Cache provides single hostname bounded to multiple ip addresses.