How to disable node auto-repair - google-cloud-platform

How do I disable GKE cluster nodes maintenance auto-repair using terraform? It seems I can't stop the nodes or change the settings of the GKE nodes from GCP console. So I guess I'll have to try it using terraform even if its recreates the cluster.
How does the maintenance happen? I think it migrates all the pods to the secondary node and then restarts the first node correct? But what if there isn't enough resources available for the secondary node to handle all the nodes from primary node? Will GCP create new node? For example: Primary node has around 110 pods and secondary node has 110 pods. How the maintenance happen if the nodes needs to be restarted?

You can disable node auto-repair by running the following command in the GCP shell:
gcloud container node-pools update <pool-name> --cluster <cluster-name> \
--zone compute-zone \
--no-enable-autorepair
You will find how to do it using the GCP console in this link as well.
If you are still facing issues and want to disable node auto-repair using terraform you have to specify in the argument management if you want to enable auto-repair. You can find further details in the terraform's documentation.
Here you can also find how the node repair process works:
If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.
If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.
If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

Related

Nodegroups are recreated (as number replica needs to match)

How to safely delete node from the cluster?
Firstly, I have drained and deleted the node.
However, after few seconds kube created again. I believe its because of the cluster service, where number of replicas are defined.
Should i update my cluster service and delete ?
Or is there any other way to safely delete ?
To delete a node and stop recreating another one automatically follow the below steps:
First drain the node
kubectl drain <node-name>
Edit instance group for nodes (using kops)
kops edit ig nodes
Finally delete the node
kubectl delete node <node-name>
Update the cluster (using kops)
kops update cluster --yes
Note: If you are using a pod autoscaler then disable or edit the replica count before deleting the node.

Google cloud kubernetes switching off a node

I'm currently testing out google cloud for a home project. I only require the node to run between a certain time slot. When I switch the node off it automatically switches itself on again. Not sure if I am missing something as I did not enabling autoscaling and it's also a General Purpose e2-small instance
When I switch the node off it automatically switches itself on again.
Not sure if I am missing something as I did not enabling autoscaling
and it's also a General Purpose e2-small instances
Kubernetes nodes are managed by the Node pool. Which you might created during your cluster creation of GKE if you are using it.
Node pool manages the number of the available node counts. there could be chances new nodes is getting created again or existing node starting back.
If you are on GKE and want to scale down to zero you can reduce number of node count in Node pool from GKE console.
Check your node pool : https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools#console_1
Resize your node pool from here : https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools#resizing_a_node_pool

rancher stuck Waiting to register with Kubernetes

I use rancher to create an EC2 cluster on aws, and I get stuck in "Waiting to register with Kubernetes" every time, as shown in the figure below.
You can see the error message "Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s)" on the Nodes page of Rancher UI. Does anyone know how to solve it?
This is the screenshot with the error
Follow installation guide carefully.
When you are ready to create cluster you have to add node with etcd role.
Each node role (i.e. etcd, Control Plane, and Worker) should be
assigned to a distinct node pool. Although it is possible to assign
multiple node roles to a node pool, this should not be done for
production clusters.
The recommended setup is to have a node pool with the etcd node role
and a count of three, a node pool with the Control Plane node role and
a count of at least two, and a node pool with the Worker node role and
a count of at least two.
Only after that Rancher is seting up cluster.
You can check the exact error (either dns or certificate) by logging into the host nodes and seeing the logs of the container (docker logs).
download the keys and try to ssh to the nodes to see more concrete error messages.

Google Kubernetes is not auto scaling to 0

I am testing the Google Kubernetes autoscaling.
I have created a cluster with 1 master node.
Then I have used
gcloud container node-pools create node-pool-test \
--machine-type g1-small --cluster test-master \
--num-nodes 1 --min-nodes 0 --max-nodes 3 \
--enable-autoscaling --zone us-central1-a
to create a node pool with autoscaling and minimum node to 0.
Now, the problem is that it's been 30 minutes since the node pool was created (and I haven't run any pods) but the node pool is not scaling down to 0. It was supposed to scale down in 10 minutes.
Some system pods are running on this node pool but the master node is also running them.
What am I missing?
Have a look at the documentation:
If you specify a minimum of zero nodes, an idle node pool can scale
down completely. However, at least one node must always be available
in the cluster to run system Pods.
and also check the limitations here and here:
Occasionally, cluster autoscaler cannot scale down completely and an
extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node
and possible workaround.
More information you can find at Autoscaler FAQ.
Also, as a solution, you could create one node pool with a small machine for system pods, and an additional node pool with a big machine where you would run your workload. This way the second node pool can scale down to 0 and you still have space to run the system pods. Here you can find an example.

Taint eks node-group

I have a cluster with 2 node groups: real time and general. I would like only pods which tolerate affinity real time to be able to run on nodes from the real time cluster.
My approach was to taint the relevant nodes and add toleration to the pod that I want to register to that node. I came into a dead-end when I was trying to taint a node-group. In my case I have an EKS node group that is elastic, i.e. nodes are increasing and decreasing in numbers constantly. How can I configure the group so that nodes from one group will be tainted upon creation?
I assume you're creating your nodeGroup via CloudFormation?
If that is the case you can add --kubelet-extra-args --register-with-taints={key}={value}:NoSchedule as your ${BootstrapArguments} for your LaunchConfig
/etc/eks/bootstrap.sh ${clusterName} ${BootstrapArguments}
That way, whenever you scale up or down your cluster, a Node will be spawned with the appropriate taint.