How to reduce the downtime while launching a EKS cluster? - amazon-web-services

I am trying to launch a kubernetes cluster over EKS which would have multiple pods in it . Once the worker node has maximum pods on it running then a new node launches and the extra pod launches over the new node. Launching of a new node takes time and creates a downtime which I want to reduce. Pod disruption budget is one option but I am not sure how to use it with scaling up of nodes.

A simpler way to approach this would be to have your scaling policies pre-defined to scale up at reasonably lower limits. This way, let's say if your server reaches 60% of the capacity and triggers a scale up - you would have enough grace time to not face a downtime (since the first one can handle requests while the second one bootstraps) and allow the the new server to come up.

Related

AWS EKS - use spot nodes with failover to on-demand

Can you please kindly suggest.
I run GPU-based workload, and my instances are g4dn.large. I would like to use spot instances, since on-demand cost a lot :-)
But it is very usual that spot GPU instances are not available for long periods. I have configured initially origian aws cluster autoscaler, with priorities scaling - so i had 2 node groups, one spot and one on-demand, and autoscaler scaled first spot, and if not available - it scaled on-demand.
But after some time all instances became on-demand - due to absent spot capacity.
Autoscaler tries to scale up spot nodegroup, it has no capacity, it scales on-demand then, pod is running, happy time!
But there is no logic to try spot capacity again and to rebalance pods again, so my pod will be rescheduled to spot instance when it comes up. yes, i can delete on-demand node after some time, and if spot capacity can be fullfilled - it will create spot instance.
I have tried Karpenter, and it seems to do some work - but not the way i would like. It is possible to configure "node expiration" in Karpenter, so it will, for example, expire node every 5 minutes. But it doesn't care about spot or ondemand in expiration logic. So if we have spot instance - it will be expired. But if we have no spot capacity at the moment - and we get on-demand, it will expire it in 5 minutes, and tries to get spot capacity.
Can you please kindly suggest, how can i achieve the schema, when my EKS cluster has GPU instances, and if there is no spot capacity - it creates ondemand instance and constantly tries to create spot capacity, and when succeded - it reschedule pod to spot and terminates on-demand instance?
Any help will be extremely appreciated!
should help:
https://github.com/weaveworks/eksctl/blob/main/examples/08-spot-instances.yaml
https://eksctl.io/usage/spot-instances/
https://eksctl.io/usage/schema/#nodeGroups-instancesDistribution
you can modify onDemandBaseCapacity or onDemandPercentageAboveBaseCapacity based on your need on the fail over to on-demand node.

Is it possible to use Kubernetes Cluster Autoscaler to scale nodes if number of nodes hit a threshold?

I created an EKS cluster but while deploying pods, I found out that the native AWS CNI only supports a set number of pods because of the IP restrictions on its instances. I don't want to use any third-party plugins because AWS doesn't support them and we won't be able to get their tech support. What happens right now is that as soon as the IP limit is hit for that instance, the scheduler is not able to schedule the pods and the pods go into pending state.
I see there is a cluster autoscaler which can do horizontal scaling.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Using a larger instance type with more available IPs is an option but that is not scalable since we will run out of IPs eventually.
Is it possible to set a pod limit for each node in cluster-autoscaler and if that limit is reached, a new instance is spawned. Since each pod uses one secondary IP of the node so that would solve our issue of not having to worry about scaling. Is this a viable option? and also if anybody has faced this and would like to share how they overcame this limitation.
EKS's node group is using auto scaling group for nodes scaling.
You can follow this workshop as a dedicated example.

ECS unable to place task despite increasing instance count

I'm facing the following problem when creating new instances and increasing the container desired count at the same time. Since the instances are not running when I increase the desired count, I get a "service XXX was unable to place a task because no container instance met all of its requirements.". A few seconds later the new instances are up, however, the cluster still has "Desire count: 30, Pending count: 0, Running count: 3". In other words, the cluster does not "know" that there are new instances and no new containers are created.
How can I avoid this situation? Is there a parameter that instructs the cluster to monitor the instance count other than immediately after an increase in desired count?
In this case its an expected behavior of ECS, reason being that ECS service scheduler includes circuit breaker logic that throttles how often tasks are placed if they repeatedly fail to launch.
When a new container instance in spined up it takes some time to get
register to the Cluster and it looks like service is getting throttled
because time taken from increase in desired count to registration of
container instances to the cluster.
Having said that, if you wait for ~15 minutes after scaling number of instance in the cluster, Service scheduler will start placing the task on new container instances.
To avoid this situation, ECS Cluster should be autoscaled based on Custer reservation metric, by this ECS cluster will have additional capacity beforehand to accommodate new task count.
and here is a tutorial on scaling ECS cluster.

Autoscaling a Cassandra cluster on AWS

I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances. Despite the load balancer one of the autoscaled nodes has zero CPU utilization and the other autoscaled node has considerable traffic on it.
I have experimented more than 4 times to auto-scale a 3 node with RF3CL1 and the CPU utilization on one of the autoscaling nodes is still zero. The overall CPU utilization has a drop but one of the autoscaled nodes is consistently idle from the point of auto scaling.
Note that the two nodes which are launched at the point of autoscaling are started by the same launch configuration. The two nodes have the same configuration in every aspect. There is an alarm for the triggering of the nodes and the scaling policy is set as per that alarm.
Can there be a bash script that can be run on the user data?
For example, altering the keyspaces?
Can someone let me know what could be the reason behind this behavior?
AWS auto scaling and load balancing is not a good fit for Cassandra. Cassandra has its own built in clustering with seed nodes to discover the other members of the cluster, so there is no need for an ELB. And auto scaling can screw you up because the data has to be re-balanced between the nodes.
https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
yes, you don't need ELB for Cassandra.
So you created a single node Cassandra, and created some keyspace. Then you scaled Cassandra to three nodes. You found one new node was idle when accessing the existing keyspace. Is this understanding correct? Did you alter the existing keyspace's replication factor to 3? If not, the existing keyspace's data will still have 1 replica.
When adding the new nodes, Cassandra will automatically balance some tokens to the new nodes. This is probably why you are seeing load on one of the new nodes, which happens to get some tokens that has keyspace data.

Updating an AWS ECS Service

I have a service running on AWS EC2 Container Service (ECS). My setup is a relatively simple one. It operates with a single task definition and the following details:
Desired capacity set at 2
Minimum healthy set at 50%
Maximum available set at 200%
Tasks run with 80% CPU and memory reservations
Initially, I am able to get the necessary EC2 instances registered to the cluster that holds the service without a problem. The associated task then starts running on the two instances. As expected – given the CPU and memory reservations – the tasks take up almost the entirety of the EC2 instances' resources.
Sometimes, I want the task to use a new version of the application it is running. In order to make this happen, I create a revision of the task, de-register the previous revision, and then update the service. Note that I have set the minimum healthy percentage to require 2 * 0.50 = 1 instance running at all times and the maximum healthy percentage to permit up to 2 * 2.00 = 4 instances running.
Accordingly, I expected 1 of the de-registered task instances to be drained and taken offline so that 1 instance of the new revision of the task could be brought online. Then the process would repeat itself, bringing the deployment to a successful state.
Unfortunately, the cluster does nothing. In the events log, it tells me that it cannot place the new tasks, even though the process I have described above would permit it to do so.
How can I get the cluster to perform the behavior that I am expecting? I have only been able to get it to do so when I manually register another EC2 instance to the cluster and then tear it down after the update is complete (which is not desirable).
I have faced the same issue where the tasks used to get stuck and had no space to place them. Below snippet from AWS doc on updating a service helped me to make the below decision.
If your service has a desired number of four tasks and a maximum
percent value of 200%, the scheduler may start four new tasks before
stopping the four older tasks (provided that the cluster resources
required to do this are available). The default value for maximum
percent is 200%.
We should have the cluster resources available / container instances available to have the new tasks get started so they can start and the older one can drain.
These are the things i do
Before doing a service update add like 20% capacity to your cluster. You can use the ASG (Autoscaling group) commandline and from the desired capacity add 20% to your cluster. This way you will have some additional instance during deployment.
Once you have the instance the new tasks will start spinning up quickly and the older one will start draining.
But does this mean i will have extra container instances ?
Yes, during the deployment you will add some instances but as the older tasks drain they will hang around. The way to remove them is
Create a MemoryReservationLow alarm (~70% threshold in your case) for like 25 mins (longer duration to be sure that we have over commissioned). As the reservation will go low once you have those extra server not being used they can be removed.
I have seen this before. If your port mapping is attempting to map a static host port to the container within the task, you need more cluster instances.
Also this could be because there is not enough available memory to meet the memory (soft or hard) limit requested by the container within the task.