Google Kubernetes is not auto scaling to 0 - google-cloud-platform

I am testing the Google Kubernetes autoscaling.
I have created a cluster with 1 master node.
Then I have used
gcloud container node-pools create node-pool-test \
--machine-type g1-small --cluster test-master \
--num-nodes 1 --min-nodes 0 --max-nodes 3 \
--enable-autoscaling --zone us-central1-a
to create a node pool with autoscaling and minimum node to 0.
Now, the problem is that it's been 30 minutes since the node pool was created (and I haven't run any pods) but the node pool is not scaling down to 0. It was supposed to scale down in 10 minutes.
Some system pods are running on this node pool but the master node is also running them.
What am I missing?

Have a look at the documentation:
If you specify a minimum of zero nodes, an idle node pool can scale
down completely. However, at least one node must always be available
in the cluster to run system Pods.
and also check the limitations here and here:
Occasionally, cluster autoscaler cannot scale down completely and an
extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node
and possible workaround.
More information you can find at Autoscaler FAQ.
Also, as a solution, you could create one node pool with a small machine for system pods, and an additional node pool with a big machine where you would run your workload. This way the second node pool can scale down to 0 and you still have space to run the system pods. Here you can find an example.

Related

How to disable node auto-repair

How do I disable GKE cluster nodes maintenance auto-repair using terraform? It seems I can't stop the nodes or change the settings of the GKE nodes from GCP console. So I guess I'll have to try it using terraform even if its recreates the cluster.
How does the maintenance happen? I think it migrates all the pods to the secondary node and then restarts the first node correct? But what if there isn't enough resources available for the secondary node to handle all the nodes from primary node? Will GCP create new node? For example: Primary node has around 110 pods and secondary node has 110 pods. How the maintenance happen if the nodes needs to be restarted?
You can disable node auto-repair by running the following command in the GCP shell:
gcloud container node-pools update <pool-name> --cluster <cluster-name> \
--zone compute-zone \
--no-enable-autorepair
You will find how to do it using the GCP console in this link as well.
If you are still facing issues and want to disable node auto-repair using terraform you have to specify in the argument management if you want to enable auto-repair. You can find further details in the terraform's documentation.
Here you can also find how the node repair process works:
If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.
If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.
If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

Queries on GKE Autoscaling

I have both Node Auto Provisioning and Autoscaling enabled on a GKE Cluster. Few queries on Auto Scaling.
For AutoScaling, the minimum nodes are 1 and the maximum number of nodes is 2. Few queries based on this setup.
I set the number of nodes to 0 using gcloud command
gcloud container clusters resize cluster-gke-test --node-pool pool-1 --num-nodes 0 --zone us-central1-c
Now I can see the message that Pods are unscheduleable
Your cluster has one or more unschedulable pods.
Following are my queries
Since autoscaling is enabled , the nodes should have been automatically spawned in order to run these Pods . But I don't see this happening . Is this not the expected behavior ?
Auto Scaling does not work when we reduce the number of nodes manually?
Auto Scaling works based on load only. If there are requests which cannot be handled by existing nodes then only it will launch new nodes. The minimum number of nodes for Node autoscaling to work should be always greater than zero?
It's a documented limitation. If your node pool is set to 0, there isn't auto scaling from 0
Yes it works as long as you don't manually scale to 0.
It's also documented. The node pool scale according with the request. If a Pod is unschedulable because of a lack of resource, and the max-node isn't reach, a new node is provisioned and the pod deployed.
you can set 0 to min-nodes, but you must at least have 1 node active in the cluster, on anther node pool
If you specify a minimum of zero nodes, an idle node pool can scale down completely. However, at least one node must always be available in the cluster to run system Pods.

Taint eks node-group

I have a cluster with 2 node groups: real time and general. I would like only pods which tolerate affinity real time to be able to run on nodes from the real time cluster.
My approach was to taint the relevant nodes and add toleration to the pod that I want to register to that node. I came into a dead-end when I was trying to taint a node-group. In my case I have an EKS node group that is elastic, i.e. nodes are increasing and decreasing in numbers constantly. How can I configure the group so that nodes from one group will be tainted upon creation?
I assume you're creating your nodeGroup via CloudFormation?
If that is the case you can add --kubelet-extra-args --register-with-taints={key}={value}:NoSchedule as your ${BootstrapArguments} for your LaunchConfig
/etc/eks/bootstrap.sh ${clusterName} ${BootstrapArguments}
That way, whenever you scale up or down your cluster, a Node will be spawned with the appropriate taint.

Pod limit on Node - AWS EKS

On AWS EKS
I'm adding deployment with 17 replicas (requesting and limiting 64Mi memory) to a small cluster with 2 nodes type t3.small.
Counting with kube-system pods, total running pods per node is 11 and 1 is left pending, i.e.:
Node #1:
aws-node-1
coredns-5-1as3
coredns-5-2das
kube-proxy-1
+7 app pod replicas
Node #2:
aws-node-1
kube-proxy-1
+9 app pod replicas
I understand that t3.small is a very small instance. I'm only trying to understand what is limiting me here. Memory request is not it, I'm way below the available resources.
I found that there is IP addresses limit per node depending on instance type.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html?shortFooter=true#AvailableIpPerENI .
I didn't find any other documentation saying explicitly that this is limiting pod creation, but I'm assuming it does.
Based on the table, t3.small can have 12 IPv4 addresses. If this is the case and this is limiting factor, since I have 11 pods, where did 1 missing IPv4 address go?
The real maximum number of pods per EKS instance are actually listed in this document.
For t3.small instances, it is 11 pods per instance. That is, you can have a maximum number of 22 pods in your cluster. 6 of these pods are system pods, so there remains a maximum of 16 workload pods.
You're trying to run 17 workload pods, so it's one too much. I guess 16 of these pods have been scheduled and 1 is left pending.
The formula for defining the maximum number of pods per instance is as follows:
N * (M-1) + 2
Where:
N is the number of Elastic Network Interfaces (ENI) of the instance type
M is the number of IP addresses of a single ENI
So, for t3.small, this calculation is 3 * (4-1) + 2 = 11.
Values for N and M for each instance type in this document.
For anyone who runs across this when searching google. Be advised that as of August 2021 its now possible to increase the max pods on a node using the latest AWS CNI plugin as described here.
Using the basic configuration explained there a t3.medium node went from a max of 17 pods to a max of 110 which is more then adequate for what I was trying to do.
This is why we stopped using EKS in favor of a KOPS deployed self-managed cluster.
IMO EKS which employs the aws-cni causes too many constraints, it actually goes against one of the major benefits of using Kubernetes, efficient use of available resources.
EKS moves the system constraint away from CPU / memory usage into the realm of network IP limitations.
Kubernetes was designed to provide high density, manage resources efficiently. Not quite so with EKS’s version, since a node could be idle, with almost its entire memory available and yet the cluster will be unable to schedule pods on an otherwise low utilized node if pods > (N * (M-1) + 2).
One could be tempted to employ another CNI such as Calico, however would be limited to worker nodes since access to master nodes is forbidden. 
This causes the cluster to have two networks and problems will arise when trying to access K8s API, or working with Admissions Controllers.
It really does depend on workflow requirements, for us, high pod density, efficient use of resources, and having complete control of the cluster is paramount.
connect to you EKS node
run this
/etc/eks/bootstrap.sh clusterName --use-max-pods false --kubelet-extra-args '--max-pods=50'
ignore nvidia-smi not found the output
whole script location https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh
EKS allows to increase max number of pods per node but this can be done only with Nitro instances. check the list here
Make sure you have VPC CNI 1.9+
Enable Prefix delegation for VPC_CNI plugin
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
If you are using self managed node group, make sure to pass the following in BootstrapArguments
--use-max-pods false --kubelet-extra-args '--max-pods=110'
or you could create the node group using eksctl using
eksctl create nodegroup --cluster my-cluster --managed=false --max-pods-per-node 110
If you are using managed node group with a specified AMI, it has bootstrap.sh so you could modify user_data to do something like this
/etc/eks/bootstrap.sh my-cluster \ --use-max-pods false \ --kubelet-extra-args '--max-pods=110'
Or simply using eksctl by running
eksctl create nodegroup --cluster my-cluster --max-pods-per-node 110
For more details, check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html

Deploy to a node-pool with a minimum node size of 0?

I have a node pool with a minimum pool size of 0, and a max pool size of 3. Normally, nothing is happening on this node pool, so GKE correctly scales down to zero. However, if I tried to submit a job to this pool via kubectl, the pod fails with Unschedulable.
I can run a kubectl with --enable-autoscaling --min-nodes 1 --max-nodes 3 wait 10 seconds, and then deploy, and then wait for completion before changing the min-nodes back to 0 but this doesn't seem ideal.
Is there a better way to get the pool to start a node when a pod is pending?
Even with something like taints or nodeAffinity, I don't think you can tell Kubernetes to spin up nodes in order to schedule workloads. The scheduler requires a node to be available already.
(Out of curiosity, how were you scheduling jobs to a specific nodepool via kubectl?)
As the autoscaler is based on Pod resource requests to scale up/down, you need at least 1 value that autoscaler can use as base calculation to know if the pool need additional node or not.
Here is more information about How cluster autoscaler works [1]