I'm trying to start a new Kubernetes cluster on AWS with the following settings:
export KUBERNETES_PROVIDER=aws
export KUBE_AWS_INSTANCE_PREFIX="k8-update-test"
export KUBE_AWS_ZONE="eu-west-1a"
export AWS_S3_REGION="eu-west-1"
export ENABLE_NODE_AUTOSCALER=true
export NON_MASQUERADE_CIDR="10.140.0.0/20"
export SERVICE_CLUSTER_IP_RANGE="10.140.1.0/24"
export DNS_SERVER_IP="10.140.1.10"
export MASTER_IP_RANGE="10.140.2.0/24"
export CLUSTER_IP_RANGE="10.140.3.0/24"
After running $KUBE_ROOT/cluster/kube-up.sh the master appears and 4 (default) minions are started. Unfortunately only one minion gets read. The result of kubectl get nodes is:
NAME STATUS AGE
ip-172-20-0-105.eu-west-1.compute.internal NotReady 19h
ip-172-20-0-106.eu-west-1.compute.internal NotReady 19h
ip-172-20-0-107.eu-west-1.compute.internal Ready 19h
ip-172-20-0-108.eu-west-1.compute.internal NotReady 19h
Please not that one node is running while 3 are not ready. If I look at the details of a NotReady node I get the following error:
ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0
right now.
If I try to start the cluster with out the settings NON_MASQUERADE_CIDR, SERVICE_CLUSTER_IP_RANGE, DNS_SERVER_IP, MASTER_IP_RANGE and CLUSTER_IP_RANGE everything works fine. All minions get ready as soon as they are started.
Does anyone has an idea why the PodCIDR was only set on one node but not on the other nodes?
One more thing: The same settings worked fine on kubernetes 1.2.4.
Your Cluster IP range is too small. You've allocated a /24 for your entire cluster (255 addresses), and Kubernetes by default will give a /24 to each node. This means that the first node will be allocated 10.140.3.0/24 and then you won't have any further /24 ranges to allocate to the other nodes in your cluster.
The fact that this worked in 1.2.4 was a bug, because the CIDR allocator wasn't checking that it didn't allocate ranges beyond the cluster ip range (which it now does). Try using a larger range for your cluster (GCE uses a /14 by default, which allows you to scale to 1000 nodes, but you should be fine with a /20 for a small cluster).
Related
I have a cluster that loses kube worker nodes every so often (I'm are moving away from this service provider for this reason), but I'd still like to harden Istio from going down when we a kube node. The problem seems to be that if the node dies that Istio has created the ingress gateway pod on, the services goes down until that node comes back up. Is there a way to scale the ingress gateway to multiple pods and give an affinity so it doesn't get scheduled on the same node? That way if we lose a kube worker node, we don't lose all our services on that gateway.
I've also thought about adding two gateways, but then they'd have different IPs and I'd have to deal with that upstream (not the end of the world I guess), but was hoping Istio had a solution to this.
Version
$ istio-1.13.1/bin/istioctl version
client version: 1.13.1
control plane version: 1.13.1
data plane version: 1.13.1 (24 proxies)
$ kubectl version --short
Client Version: v1.22.3
Server Version: v1.23.8
Possible Solution
Ok, finally came across this, maybe now looking for confirmation this it the right thing to do.
Adding the following to spec.components.ingressGateway.0 in the operator seems to scale the pod. And when I delete the original pod, I don't lose a single packet.
hpaSpec:
minReplicas: 2
I am trying to follow this tutorial:
https://www.digitalocean.com/community/tutorials/how-to-create-a-kubernetes-cluster-using-kubeadm-on-ubuntu-18-04
Important difference:
I need to run the master on a specific node, and the worker nodes are from different regions on AWS.
So it all went well until I wanted to join the nodes (step 5). The command succeeded but kubectl get nodes still only showed the master node.
I looked at the join command and it contained the master 's private ip address:
join 10.1.1.40. I guess that can not work if the workers are in a different region (note: later we probably need to add nodes from different providers even, so if there is no important security threat, it should work via public IPs).
So while kubeadm init pod-network-cidr=10.244.0.0/16 initialized the cluster but with this internal IP, I then tried with
kubeadm init --apiserver-advertise-address <Public-IP-Addr> --apiserver-bind-port 16443 --pod-network-cidr=10.244.0.0/16
But then it always hangs, and init does not complete. The kubelet log prints lots of
E0610 19:24:24.188347 1051920 kubelet.go:2267] node "ip-x-x-x-x" not found
where "ip-x-x-x-x" seems to be the master's node hostname on AWS.
I think what made it work is that I set the master's hostname to its public DNS name, and then used that as --control-plane-endpoint argument..., without --apiserver-advertise-address (but with the --apiserver-bind-port as I need to run it on another port).
Need to have it run longer to confirm but so far looks good.
On AWS EKS
I'm adding deployment with 17 replicas (requesting and limiting 64Mi memory) to a small cluster with 2 nodes type t3.small.
Counting with kube-system pods, total running pods per node is 11 and 1 is left pending, i.e.:
Node #1:
aws-node-1
coredns-5-1as3
coredns-5-2das
kube-proxy-1
+7 app pod replicas
Node #2:
aws-node-1
kube-proxy-1
+9 app pod replicas
I understand that t3.small is a very small instance. I'm only trying to understand what is limiting me here. Memory request is not it, I'm way below the available resources.
I found that there is IP addresses limit per node depending on instance type.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html?shortFooter=true#AvailableIpPerENI .
I didn't find any other documentation saying explicitly that this is limiting pod creation, but I'm assuming it does.
Based on the table, t3.small can have 12 IPv4 addresses. If this is the case and this is limiting factor, since I have 11 pods, where did 1 missing IPv4 address go?
The real maximum number of pods per EKS instance are actually listed in this document.
For t3.small instances, it is 11 pods per instance. That is, you can have a maximum number of 22 pods in your cluster. 6 of these pods are system pods, so there remains a maximum of 16 workload pods.
You're trying to run 17 workload pods, so it's one too much. I guess 16 of these pods have been scheduled and 1 is left pending.
The formula for defining the maximum number of pods per instance is as follows:
N * (M-1) + 2
Where:
N is the number of Elastic Network Interfaces (ENI) of the instance type
M is the number of IP addresses of a single ENI
So, for t3.small, this calculation is 3 * (4-1) + 2 = 11.
Values for N and M for each instance type in this document.
For anyone who runs across this when searching google. Be advised that as of August 2021 its now possible to increase the max pods on a node using the latest AWS CNI plugin as described here.
Using the basic configuration explained there a t3.medium node went from a max of 17 pods to a max of 110 which is more then adequate for what I was trying to do.
This is why we stopped using EKS in favor of a KOPS deployed self-managed cluster.
IMO EKS which employs the aws-cni causes too many constraints, it actually goes against one of the major benefits of using Kubernetes, efficient use of available resources.
EKS moves the system constraint away from CPU / memory usage into the realm of network IP limitations.
Kubernetes was designed to provide high density, manage resources efficiently. Not quite so with EKS’s version, since a node could be idle, with almost its entire memory available and yet the cluster will be unable to schedule pods on an otherwise low utilized node if pods > (N * (M-1) + 2).
One could be tempted to employ another CNI such as Calico, however would be limited to worker nodes since access to master nodes is forbidden.
This causes the cluster to have two networks and problems will arise when trying to access K8s API, or working with Admissions Controllers.
It really does depend on workflow requirements, for us, high pod density, efficient use of resources, and having complete control of the cluster is paramount.
connect to you EKS node
run this
/etc/eks/bootstrap.sh clusterName --use-max-pods false --kubelet-extra-args '--max-pods=50'
ignore nvidia-smi not found the output
whole script location https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh
EKS allows to increase max number of pods per node but this can be done only with Nitro instances. check the list here
Make sure you have VPC CNI 1.9+
Enable Prefix delegation for VPC_CNI plugin
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
If you are using self managed node group, make sure to pass the following in BootstrapArguments
--use-max-pods false --kubelet-extra-args '--max-pods=110'
or you could create the node group using eksctl using
eksctl create nodegroup --cluster my-cluster --managed=false --max-pods-per-node 110
If you are using managed node group with a specified AMI, it has bootstrap.sh so you could modify user_data to do something like this
/etc/eks/bootstrap.sh my-cluster \ --use-max-pods false \ --kubelet-extra-args '--max-pods=110'
Or simply using eksctl by running
eksctl create nodegroup --cluster my-cluster --max-pods-per-node 110
For more details, check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
I am new to Kubernetes and I am facing a problem that I do not understand. I created a 4-node cluster in aws, 1 manager node (t2.medium) and 3 normal nodes (c4.xlarge) and they were successfully joined together using Kubeadm.
Then I tried to deploy three Cassandra replicas using this yaml but the pod state does not leave the pending state; when I do:
kubectl describe pods cassandra-0
I get the message
0/4 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 3 Insufficient memory.
And I do not understand why, as the machines should be powerful enough to cope with these pods and I haven't deployed any other pods. I am not sure if this means anything but when I execute:
kubectl describe nodes
I see this message:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Therefore my question is why this is happening and how can I fix it.
Thank you for your attention
Each node tracks the total amount of requested RAM (resources.requests.memory) for all pods assigned to it. That cannot exceed the total capacity of the machine. I would triple check that you have no other pods. You should see them on kubectl describe node.
Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:
cassandra-0 -> zone a
cassandra-1 -> zone b
cassandra-2 -> zone c
Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.
If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.
Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:
cassandra-1 -> zone b
cassandra-2 -> zone c
Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.
Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node.
It shouldn't be a problem.
However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.
So if you use Statefulset + VolumeClaim + across zones
AND you experience an entire AZ failure
AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure
=> then you will loose all your Cassandra pods. Up until zone-a is back online
This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?
Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.
Two options:
Option 1: use podManagementPolicy and set it to Parallel.
The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time.
Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time
Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters
Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit.
Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/.
The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again
The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter
Option 2: create 3 different statefulets.
1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet)
In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty.
My setup can survive a complete zone outage thanks to replication-factor=3
Tips:
the list of seed node contains all 3 nodes separated by commas:
"cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.
I think that if you can control the deployment of each pod (cassandra-0, cassandra-1, cassandra-2 with three different yaml deployment files), you can use podAffinity set to a specific zone for each pod.
Once a node on a zone fails and the pod running inside that server has to be rescheduled, the affinity will force Kubernetes to deploy the pod on a different node of the same Zone, and if no nodes are available on the same zone, Kubernetes should keep that pod down indefinitely.
For example, you may create a Kubernetes cluster with three different managedNodeGroup, one for each zone (label "zone": "a", "b", "c" for each group), with at least two nodes for each group, and use the podAffinity.
Note: Do not use x1.32xlarge machines for the nodes :-)