Autoscaling a Cassandra cluster on AWS - amazon-web-services

I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances. Despite the load balancer one of the autoscaled nodes has zero CPU utilization and the other autoscaled node has considerable traffic on it.
I have experimented more than 4 times to auto-scale a 3 node with RF3CL1 and the CPU utilization on one of the autoscaling nodes is still zero. The overall CPU utilization has a drop but one of the autoscaled nodes is consistently idle from the point of auto scaling.
Note that the two nodes which are launched at the point of autoscaling are started by the same launch configuration. The two nodes have the same configuration in every aspect. There is an alarm for the triggering of the nodes and the scaling policy is set as per that alarm.
Can there be a bash script that can be run on the user data?
For example, altering the keyspaces?
Can someone let me know what could be the reason behind this behavior?

AWS auto scaling and load balancing is not a good fit for Cassandra. Cassandra has its own built in clustering with seed nodes to discover the other members of the cluster, so there is no need for an ELB. And auto scaling can screw you up because the data has to be re-balanced between the nodes.
https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

yes, you don't need ELB for Cassandra.
So you created a single node Cassandra, and created some keyspace. Then you scaled Cassandra to three nodes. You found one new node was idle when accessing the existing keyspace. Is this understanding correct? Did you alter the existing keyspace's replication factor to 3? If not, the existing keyspace's data will still have 1 replica.
When adding the new nodes, Cassandra will automatically balance some tokens to the new nodes. This is probably why you are seeing load on one of the new nodes, which happens to get some tokens that has keyspace data.

Related

How to reduce the downtime while launching a EKS cluster?

I am trying to launch a kubernetes cluster over EKS which would have multiple pods in it . Once the worker node has maximum pods on it running then a new node launches and the extra pod launches over the new node. Launching of a new node takes time and creates a downtime which I want to reduce. Pod disruption budget is one option but I am not sure how to use it with scaling up of nodes.
A simpler way to approach this would be to have your scaling policies pre-defined to scale up at reasonably lower limits. This way, let's say if your server reaches 60% of the capacity and triggers a scale up - you would have enough grace time to not face a downtime (since the first one can handle requests while the second one bootstraps) and allow the the new server to come up.

Is it possible to use Kubernetes Cluster Autoscaler to scale nodes if number of nodes hit a threshold?

I created an EKS cluster but while deploying pods, I found out that the native AWS CNI only supports a set number of pods because of the IP restrictions on its instances. I don't want to use any third-party plugins because AWS doesn't support them and we won't be able to get their tech support. What happens right now is that as soon as the IP limit is hit for that instance, the scheduler is not able to schedule the pods and the pods go into pending state.
I see there is a cluster autoscaler which can do horizontal scaling.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Using a larger instance type with more available IPs is an option but that is not scalable since we will run out of IPs eventually.
Is it possible to set a pod limit for each node in cluster-autoscaler and if that limit is reached, a new instance is spawned. Since each pod uses one secondary IP of the node so that would solve our issue of not having to worry about scaling. Is this a viable option? and also if anybody has faced this and would like to share how they overcame this limitation.
EKS's node group is using auto scaling group for nodes scaling.
You can follow this workshop as a dedicated example.

Kubernetes cluster autoscaling using Kubeadm

I am using kubernetes v1.11.1 configured using kubeadm consisting of five nodes and hundreds of pods are running. How can I enable or configure cluster autoscaling based on the total memory utilization of the cluster?
K8s cluster can be scaled with the help of Cluster Autoscaler(CA) cluster autoscaler github page, find info on AWS CA there.
It is not scaling the cluster based on “total memory utilization” but based on “pending pods” in the cluster due to not enough available cluster resources to meet their CPU and Memory requests. 
Basically, Cluster Autoscaler(CA) checks for pending(unschedulable) pods every 10 seconds and if it finds any, it will request AWS Autoscaling Group(ASG) API to increase the number of instances in ASG. When a node to ASG is added, it then joins the cluster and becomes ready to serve pods. After that K8s Scheduler allocates “pending pods” to a new node.
Scale-down is done by CA checking every 10 seconds which nodes are unneeded and the node is considered for removal if: the sum of CPU and Memory Requests of all pods is smaller than 50% of node’s capacity, pods can be moved to other nodes and no scale-down disabled annotation. 
If K8s cluster on AWS is administered with Kubeadm, all the above holds true. So in a nutshell(intricate details omitted, refer to the doc on CA):
Create Autoscaling Group(ASG) aws ASG doc.
Add tags to ASG like k8s.io/cluster-autoscaler/enable(mandatory),
k8s.io/cluster-autoscaler/cluster_name(optional).
Launch “CA” in a cluster following the offical doc.

Kubernetes Stateful set, AZ and Volume claims: what happens when an AZ fails

Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:
cassandra-0 -> zone a
cassandra-1 -> zone b
cassandra-2 -> zone c
Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.
If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.
Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:
cassandra-1 -> zone b
cassandra-2 -> zone c
Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.
Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node.
It shouldn't be a problem.
However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.
So if you use Statefulset + VolumeClaim + across zones
AND you experience an entire AZ failure
AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure
=> then you will loose all your Cassandra pods. Up until zone-a is back online
This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?
Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.
Two options:
Option 1: use podManagementPolicy and set it to Parallel.
The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time.
Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time
Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters
Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit.
Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/.
The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again
The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter
Option 2: create 3 different statefulets.
1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet)
In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty.
My setup can survive a complete zone outage thanks to replication-factor=3
Tips:
the list of seed node contains all 3 nodes separated by commas:
"cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.
I think that if you can control the deployment of each pod (cassandra-0, cassandra-1, cassandra-2 with three different yaml deployment files), you can use podAffinity set to a specific zone for each pod.
Once a node on a zone fails and the pod running inside that server has to be rescheduled, the affinity will force Kubernetes to deploy the pod on a different node of the same Zone, and if no nodes are available on the same zone, Kubernetes should keep that pod down indefinitely.
For example, you may create a Kubernetes cluster with three different managedNodeGroup, one for each zone (label "zone": "a", "b", "c" for each group), with at least two nodes for each group, and use the podAffinity.
Note: Do not use x1.32xlarge machines for the nodes :-)

It is possible use AutoScaling with Elastic Mapreduce?

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?