Nodegroups are recreated (as number replica needs to match) - amazon-web-services

How to safely delete node from the cluster?
Firstly, I have drained and deleted the node.
However, after few seconds kube created again. I believe its because of the cluster service, where number of replicas are defined.
Should i update my cluster service and delete ?
Or is there any other way to safely delete ?

To delete a node and stop recreating another one automatically follow the below steps:
First drain the node
kubectl drain <node-name>
Edit instance group for nodes (using kops)
kops edit ig nodes
Finally delete the node
kubectl delete node <node-name>
Update the cluster (using kops)
kops update cluster --yes
Note: If you are using a pod autoscaler then disable or edit the replica count before deleting the node.

Related

How can I create node in existing EKS cluster? or give me a solution for my error?

I'm facing this such error in kubernetes( 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.). My application server is down.
First, I just add one file in daemon set , due to memory allocation (we are having one node), all pods are failed to allocate and shows pending state and fully clashes (stays in pending condition).If I delete all deployments and I run any new deployments also its showing pending condition .Now please help to get sort it out this issue. I also tried the taint commands, also it doesn't work.
As per my consent , can I create a node with existing cluster or revoke the instance? thanks in advance
You need to configure autoscaling (it doesn't work by default) for the cluster
https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html
Or, you can manually change the desired size of the node group.
Also, make sure that your deployment has relevant resources request for your nodes

How to disable node auto-repair

How do I disable GKE cluster nodes maintenance auto-repair using terraform? It seems I can't stop the nodes or change the settings of the GKE nodes from GCP console. So I guess I'll have to try it using terraform even if its recreates the cluster.
How does the maintenance happen? I think it migrates all the pods to the secondary node and then restarts the first node correct? But what if there isn't enough resources available for the secondary node to handle all the nodes from primary node? Will GCP create new node? For example: Primary node has around 110 pods and secondary node has 110 pods. How the maintenance happen if the nodes needs to be restarted?
You can disable node auto-repair by running the following command in the GCP shell:
gcloud container node-pools update <pool-name> --cluster <cluster-name> \
--zone compute-zone \
--no-enable-autorepair
You will find how to do it using the GCP console in this link as well.
If you are still facing issues and want to disable node auto-repair using terraform you have to specify in the argument management if you want to enable auto-repair. You can find further details in the terraform's documentation.
Here you can also find how the node repair process works:
If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.
If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.
If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

Taint eks node-group

I have a cluster with 2 node groups: real time and general. I would like only pods which tolerate affinity real time to be able to run on nodes from the real time cluster.
My approach was to taint the relevant nodes and add toleration to the pod that I want to register to that node. I came into a dead-end when I was trying to taint a node-group. In my case I have an EKS node group that is elastic, i.e. nodes are increasing and decreasing in numbers constantly. How can I configure the group so that nodes from one group will be tainted upon creation?
I assume you're creating your nodeGroup via CloudFormation?
If that is the case you can add --kubelet-extra-args --register-with-taints={key}={value}:NoSchedule as your ${BootstrapArguments} for your LaunchConfig
/etc/eks/bootstrap.sh ${clusterName} ${BootstrapArguments}
That way, whenever you scale up or down your cluster, a Node will be spawned with the appropriate taint.

aws elasticsearch created 2 instance but says it has 3 nodes

First I had a cluster with one node. I've increased one instance(node) so now it should show that I have 2 nodes but instead it says I have 3. Why is this?
That will be temporarily since AWS follows blue/green deployment model. Please see this link.
When you have a cluster with 1 node and add 1 more node, AWS ES will create a new cluster with 2 nodes and then copy the entire data set from older cluster to new one. While the copying / migration operation is in progress, you'll see 3 nodes - 2 from new cluster and 1 from old cluster. Once migration is completed, the node belonging to older cluster is deleted.

Disaster Recovery Kops Kubernetes Master Node on AWS

I have currently a cluster HA (with three multiple masters, one for every AZ) deployed on AWS through kops. Kops deploys a K8S cluster with a pod for etcd-events and a pod for etcd-server on every master node. Every one of this pods uses a mounted volume.
All works well, for example when a master dies, the autoscaling group creates another master node in the same AZ, that recovers its volume and joins itself to the cluster. The problem that I have is respect to a disaster, a failure of an AZ.
What happens if an AZ should have problems? I periodically take volume EBS snapshots, but if I create a new volume from a snapshot (with the right tags to be discovered and attached to the new instance) the new instance mounts the new volumes, but after that, it isn't able to join with the old cluster. My plan was to create a lambda function that was triggered by a CloudWatch event that creates a new master instance in one of the two safe AZ with the volume mounted from a snapshot of the old EBS volume. But this plan has errors because it seems that I am ignoring something about Raft, Etcd, and their behavior. (I say that because I have errors from the other master nodes, and the new node isn't able to join itself to the cluster).
Suggestions?
How do you recover theoretically the situation of a single AZ disaster and the situation when all the master died? I have the EBS snapshots. Is it sufficient to use them?
I'm not sure how exactly you are restoring the failed node but technically the first thing that you want to recover is your etcd node because that's where all the Kubernetes state is stored.
Since your cluster is up and running you don't need to restore from scratch, you just need to remove the old node and add the new node to etcd. You can find out more on how to do it here. You don't really need to restore any old volume to this node since it will sync up with the other existing nodes.
Then after this, you can start other services as kube-apiserver, kube-controller-manager, etc.
Having said that, if you keep the same IP address and the exact same physical configs you should be able to recover without removing the etcd node and adding a new one.