How to resize K8s cluster with kops, cluster-autoscaler to dynamically increase Masters - amazon-web-services

We have configured Kubernetes cluster on EC2 machines in our AWS account using kops tool (https://github.com/kubernetes/kops) and based on AWS posts (https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-kops/) as well as other resources.
We want to setup a K8s cluster of master and slaves such that:
It will automatically resize (both masters as well as nodes/slaves) based on system load.
Runs in Multi-AZ mode i.e. at least one master and one slave in every AZ (availability zone) in the same region for e.g. us-east-1a, us-east-1b, us-east-1c and so on.
We tried to configure the cluster in the following ways to achieve the above.
Created K8s cluster on AWS EC2 machines using kops this below configuration: node count=3, master count=3, zones=us-east-1c, us-east-1b, us-east-1a. We observed that a K8s cluster was created with 3 Master & 3 Slave Nodes. Each of the master and slave server was in each of the 3 AZ’s.
Then we tried to resize the Nodes/slaves in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). We set the node_asg_min to 3 and node_asg_max to 5. When we increased the workload on the slaves such that auto scale policy was triggered, we saw that additional (after the default 3 created during setup) slave nodes were spawned, and they did join the cluster in various AZ’s. This worked as expected. There is no question here.
We also wanted to set up the cluster such that the number of masters increases based on system load. Is there some way to achieve this? We tried a couple of approaches and results are shared below:
A) We were not sure if the cluster-auto scaler helps here, but nevertheless tried to resize the Masters in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). This is useful while creating a new cluster but was not useful to resize the number of masters in an existing cluster. We did not find a parameter to specify node_asg_min, node_asg_max for Master the way it is present for slave Nodes. Is there some way to achieve this?
B) We increased the count MIN from 1 to 3 in ASG (auto-scaling group), associated with one the three IG (instance group) for each master. We found that new instances were created. However, they did not join the master cluster. Is there some way to achieve this?
Could you please point us to steps, resources on how to do this correctly so that we could configure the number of masters to automatically resize based on system load and is in Multi-AZ mode?
Kind regards,
Shashi

There is no need to scale Master nodes.
Master components provide the cluster’s control plane. Master components make global decisions about the cluster (for example, scheduling), and detecting and responding to cluster events (starting up a new pod when a replication controller’s ‘replicas’ field is unsatisfied).
Master components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all master components on the same machine, and do not run user containers on this machine. See Building High-Availability Clusters for an example multi-master-VM setup.
Master node consists of the following components:
kube-apiserver
Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.
kube-controller-manager
Component on the master that runs controllers.
cloud-controller-manager
runs controllers that interact with the underlying cloud providers. The cloud-controller-manager binary is an alpha feature introduced in Kubernetes release 1.6.
For more detailed explanation please read the Kubernetes Components docs.
Also if You are thinking about HA, you can read about Creating Highly Available Clusters with kubeadm

I think your assumption is that similar to kubernetes nodes, masters devide the work between eachother. That is not the case, because the main tasks of masters is to have consensus between each other. This is done with etcd which is a distributed key value store. The problem maintaining such a store is easy for 1 machine but gets harder the more machines you add.
The advantage of adding masters is being able to survive more master failures at the cost of having to make all masters fatter (more CPU/RAM....) so that they perform well enough.

Related

Is it possible to use Kubernetes Cluster Autoscaler to scale nodes if number of nodes hit a threshold?

I created an EKS cluster but while deploying pods, I found out that the native AWS CNI only supports a set number of pods because of the IP restrictions on its instances. I don't want to use any third-party plugins because AWS doesn't support them and we won't be able to get their tech support. What happens right now is that as soon as the IP limit is hit for that instance, the scheduler is not able to schedule the pods and the pods go into pending state.
I see there is a cluster autoscaler which can do horizontal scaling.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Using a larger instance type with more available IPs is an option but that is not scalable since we will run out of IPs eventually.
Is it possible to set a pod limit for each node in cluster-autoscaler and if that limit is reached, a new instance is spawned. Since each pod uses one secondary IP of the node so that would solve our issue of not having to worry about scaling. Is this a viable option? and also if anybody has faced this and would like to share how they overcame this limitation.
EKS's node group is using auto scaling group for nodes scaling.
You can follow this workshop as a dedicated example.

Restore Etcd Quorum

I have a Kubernetes cluster distributed on AWS via Kops consisting of 3 master nodes, each in a different AZ. As is well known, Kops realizes the deployment of a cluster where Etcd is executed on each master node through two pods, each of which mounts an EBS volume for saving the state. If you lose the volumes of 2 of the 3 masters, you automatically lose consensus among the masters.
Is there a way to use information about the only master who still has the status of the cluster, and retrieve the Quorum between the three masters on that state? I recreated this scenario, but the cluster becomes unavailable, and I can no longer access the Etcd pods of any of the 3 masters, because those pods fail with an error. Moreover, Etcd itself becomes read-only and it is impossible to add or remove members of the cluster, to try to perform manual interventions.
Tips? Thanks to all of you
This is documented here. There's also another guide here
You basically have to backup your cluster and create a brand new one.

Automated setup for multi-server RethinkDB cluster via an ECS service

I'm attempting to set up a RethinkDB cluster with 3 servers total spread evenly across 3 private subnets, each in different AZ's in a single region.
Ideally, I'd like to deploy the DB software via ECS and provision the EC2 instances with auto scaling, but I'm having trouble trying to figure out how to instruct the RethinkDB instances to join a RethinkDB cluster.
To create/join a cluster in RethinkDB, when you start up a new instance of RethinkDB, you specify host:port combination of one of the other machines in the cluster. This is where I'm running into problems. The Auto Scaling service is creating new primary ENI's for my EC2 instances and using a random IP in my subnet's range, so I can't know the IP of the EC2 instance ahead of time. On top of that, I'm using awsvpc task networking, so ECS is creating new secondary ENI's dedicated to each docker container and attaching them to the instances when it deploys them and those are also getting new IP's, which I don't know ahead of time.
So far I've worked out one possible solution, which is to not use an autoscaling group, but instead to manually deploy 3 EC2 instances across the private subnets, which would let me assign my own, predetermined, private IP. As I understand it, this still doesn't help me if I'm using awsvpc task networking though because each container running on my instances will get its own dedicated secondary ENI and I wont know the IP of that secondary ENI ahead of time. I think I can switch my task networking to bridge mode, to get around this. That way I can use the predetermined IP of the EC2 instances (the primary ENI) in the RethinkDB join command.
So In conclusion, the only way to achieve this, that I can figure out, is to not use Auto Scaling, or awsvpc task networking, both of which would otherwise be very desirable features. Can anyone think of a better way to do this?
As mentioned in the comments, this is more of an issue around the fact you need to start a single RethinkDB instance one time to bootstrap the cluster and then handle discovery of the existing cluster members when joining new members to the cluster.
I would have thought RethinkDB would have published a good pattern in their docs for this because it's going to be pretty common when setting up clusters but I couldn't see anything useful in their docs. If someone does know of an official recommendation then you should definitely use this rather than what I'm about to propose especially as I have no experience with running RethinkDB.
This is more just spit-balling here and will be completely untested (at least for now) but the principle is going to be you need to start a single, one off instance of RethinkDB to bootstrap the cluster, then have more cluster members join and then ditch the special case bootstrap member that didn't attempt to join a cluster and leave the remaining cluster members to work.
The bootstrap instance is easy enough to consider. You just need a RethinkDB container image and an ECS task that just runs it in stand-alone mode with the ECS service only running one instance of the task. To enable the second set of cluster members to easily discover cluster members including this bootstrapping instance it's probably easiest to use a service discovery mechanism such as the one offered by ECS which uses Route53 records under the covers. The ECS service should register the service in the RethinkDB namespace.
Then you should create another ECS service that's basically the same as the first but in an entrypoint script should list the services in the RethinkDB namespace and then resolve them, discarding the container's own IP address and then uses the discovered host to join to with --join when starting RethinkDB in the container.
I'd then set the non bootstrap ECS service to just 1 task at first to allow it to discover the bootstrap version and then you should be able to keep adding tasks to the service one at a time until you're happy with the size of the non bootstrapped cluster leaving you with n + 1 instances in the cluster including the original bootstrap instance.
After that I'd remove the bootstrap ECS service entirely.
If an ECS task dies in the non bootstrap ECS service dies for whatever reason it should be able to auto rejoin without any issue as it will just find a running RethinkDB task and start that.
You could probably expand the checks for which cluster member to join to by checking that the RethinkDB port is open and running before using that as a member to join so it will handle multiple tasks being started at the same time (with my original suggestion it could potentially find another task that is looking to join the cluster and try to join to that first, with them all potentially deadlocking if they all failed to randomly pick the existing cluster members by chance).
As mentioned, this answer comes with a big caveat that I haven't got any experience running RethinkDB and I've only played with the service discovery mechanism that was recently released for ECS so might be missing something here but the general principles should hold fine.

Kubernetes Stateful set, AZ and Volume claims: what happens when an AZ fails

Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:
cassandra-0 -> zone a
cassandra-1 -> zone b
cassandra-2 -> zone c
Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.
If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.
Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:
cassandra-1 -> zone b
cassandra-2 -> zone c
Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.
Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node.
It shouldn't be a problem.
However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.
So if you use Statefulset + VolumeClaim + across zones
AND you experience an entire AZ failure
AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure
=> then you will loose all your Cassandra pods. Up until zone-a is back online
This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?
Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.
Two options:
Option 1: use podManagementPolicy and set it to Parallel.
The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time.
Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time
Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters
Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit.
Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/.
The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again
The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter
Option 2: create 3 different statefulets.
1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet)
In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty.
My setup can survive a complete zone outage thanks to replication-factor=3
Tips:
the list of seed node contains all 3 nodes separated by commas:
"cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.
I think that if you can control the deployment of each pod (cassandra-0, cassandra-1, cassandra-2 with three different yaml deployment files), you can use podAffinity set to a specific zone for each pod.
Once a node on a zone fails and the pod running inside that server has to be rescheduled, the affinity will force Kubernetes to deploy the pod on a different node of the same Zone, and if no nodes are available on the same zone, Kubernetes should keep that pod down indefinitely.
For example, you may create a Kubernetes cluster with three different managedNodeGroup, one for each zone (label "zone": "a", "b", "c" for each group), with at least two nodes for each group, and use the podAffinity.
Note: Do not use x1.32xlarge machines for the nodes :-)

Kubernetes - adding more nodes

I have a basic cluster, which has a master and 2 nodes. The 2 nodes are part of an aws autoscaling group - asg1. These 2 nodes are running application1.
I need to be able to have further nodes, that are running application2 be added to the cluster.
Ideally, I'm looking to maybe have a multi-region setup, whereby aplication2 can be run in multiple regions, but be part of the same cluster (not sure if that is possible).
So my question is, how do I add nodes to a cluster, more specifically in AWS?
I've seen a couple of articles whereby people have spun up the instances and then manually logged in to install the kubeltet and various other things, but I was wondering if it could be done in more of an automatic way?
Thanks
If you followed this instructions, you should have an autoscaling group for your minions.
Go to AWS panel, and scale up the autoscaling group. That should do it.
If you did it somehow manually, you can clone a machine selecting an existing minion/slave, and choosing "launch more like this".
As Pablo said, you should be able to add new nodes (in the same availability zone) by scaling up your existing ASG. This will provision new nodes that will be available for you to run application2. Unless your applications can't share the same nodes, you may also be able to run application2 on your existing nodes without provisioning new nodes if your nodes are big enough. In some cases this can be more cost effective than adding additional small nodes to your cluster.
To your other question, Kubernetes isn't designed to be run across regions. You can run a multi-zone configuration (in the same region) for higher availability applications (which is called Ubernetes Lite). Support for cross-region application deployments (Ubernetes) is currently being designed.