Autoscaling Kubernetes based on number of Jobs on AWS EKS - amazon-web-services

My cluster sometimes gets a "burst" of information and generates a large number of Kubernetes Jobs at once. And in other times I have ~0 active jobs.
I'm wondering how can I make it autoscale the number of nodes to continuously be able to process all these jobs in a reasonable time-frame.
I specifically use AWS EKS and each job takes a few minutes to complete.

EKS allows you to deploy cluster autoscaler so when new job can not be scheduled due to lack of available cpu/memory, extra node will be added to the cluster.

Related

How to Automate Redshift Cluster Start/Stop for night time?

I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.

Kubernetes cluster autoscaling using Kubeadm

I am using kubernetes v1.11.1 configured using kubeadm consisting of five nodes and hundreds of pods are running. How can I enable or configure cluster autoscaling based on the total memory utilization of the cluster?
K8s cluster can be scaled with the help of Cluster Autoscaler(CA) cluster autoscaler github page, find info on AWS CA there.
It is not scaling the cluster based on “total memory utilization” but based on “pending pods” in the cluster due to not enough available cluster resources to meet their CPU and Memory requests. 
Basically, Cluster Autoscaler(CA) checks for pending(unschedulable) pods every 10 seconds and if it finds any, it will request AWS Autoscaling Group(ASG) API to increase the number of instances in ASG. When a node to ASG is added, it then joins the cluster and becomes ready to serve pods. After that K8s Scheduler allocates “pending pods” to a new node.
Scale-down is done by CA checking every 10 seconds which nodes are unneeded and the node is considered for removal if: the sum of CPU and Memory Requests of all pods is smaller than 50% of node’s capacity, pods can be moved to other nodes and no scale-down disabled annotation. 
If K8s cluster on AWS is administered with Kubeadm, all the above holds true. So in a nutshell(intricate details omitted, refer to the doc on CA):
Create Autoscaling Group(ASG) aws ASG doc.
Add tags to ASG like k8s.io/cluster-autoscaler/enable(mandatory),
k8s.io/cluster-autoscaler/cluster_name(optional).
Launch “CA” in a cluster following the offical doc.

How to scale in/out EC2 instances based on ECS cluster resources availability?

I have multiple services running in my ECS cluster. Each service contains one or more tasks based on CPU utilization or a number of users.
I have deployed these containers with EC2 launch type.
Now, I want to increase/decrease the number of EC2 instances based on available resources in the cluster.
Let's say there are four ECS tasks running in two m5.large instances.
Now, if an ECS service increases the number of tasks and there aren't enough resources available in the cluster, how can I spin up an instance and add to the cluster?
And same goes for vice versa. If there is instance running with no ecs task in it, how can I destroy it automatically?
PS - I was using Fargate. Since it's cost is very high, I moved to EC2 instances.
you need to setup your ecs cluster instances in a ASG as #Nitesh says, second you need to set up a cloudwatch alert based in a key metric, with ecs is complex because you need to set up two autoscaling policies one by service another one to scale up your instances, for ec2 the metric that you could use is Cluster CPU reservation and /or Cluster memory reservation.
The scheme works like this your service increases the number of the desired container by an autoscaling rule using a key metric for your service as could be de CPU usage or the number the request in a load balancer and in consequence the Cluster CPU reservation increase this triggers the cloudwatch alert and your ASG increase the number of instaces.
Some tips scale up fast, and scale down slow this could by handle by setting up the time of the alerts
For the containers use Service Auto Scaling and Target tracking policies for more info see
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#cluster_reservation
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
https://aws.amazon.com/blogs/compute/automatic-scaling-with-amazon-ecs/
I hope this help
Regards

Scaling ECS EC2 instances when a task cannot be placed

I am using an ECS cluster for Jenkins agents/slaves with the Jenkins ECS plugin.
The plugin places a ECS Task when a job requests a build-node. Now I want to scale the EC2 instances in a Autoscaling Group associated with the ECS Cluster according to the demand.
The jenkins is often idle. In this case, I do not want there to be any instances in the autoscaling group.
If a node (and therefore an ECS task) is requested and cannot be placed, I want to add an EC2 instance to the autoscaling group.
If an instance is idle and shortly before an billing hour, I want that instance to be removed.
The 3. point can be accomplished by a cronjob on the EC2 instances that regularly checks if the conditions are met and removes the EC2 instance.
But how can I accomplish the 2. point? I am unable to create a cloudwatch alarm that triggers, if a task cannot be placed.
How can I accomplish this?
A rather hacky way to achieve this: You could use a Lambda function to detect when a service has runningCount + pendingCount < desiredCount for more than X seconds. (I have not tested this yet.)
Similar solutions are proposed here.
There does not seem to be a proper solution to scale only when tasks cannot be placed. Maybe AWS wants us to over-provision our clusters, which might be good practice for high availability, but not always the best or cheapest solution.
When a task cannot be placed it means that placing that task in your ECS cluster would exceed either your MemoryReservation or CPUReservation. You could set up Cloudwatch alarms for one or both of these ECS metrics and an auto scaling policy that will add and remove EC2 instances in your ECS cluster.
This, in combination with an auto scaling policy that scales your ECS services on the ecs:service:DesiredCount dimension should be enough to get you adding the underlying EC2 instances your ECS cluster requires.
For example your ScalingPolicy for an ECS Service might be "when we're using 70% of our allotted memory for this service, add 2 to the DesiredCount". After adding 1 service task, your ECS Cluster MemoryReservation metric might bump up past an "80" threshold, at which point a Cloudwatch alarm would trigger for some threshold on ECS MemoryReservation, with an auto scaling policy adding another EC2 node, on which the 2nd task could now be placed.
For those arriving after January 2020, the way to handle it now is probably Cluster Auto Scaling as documented here: "Amazon ECS cluster auto scaling" with more info here: "Deep Dive on Amazon ECS Cluster Auto Scaling)".
Essentially, ECS now handles most the heavy lifting. Not all, or I wouldn't be here looking for an answer ;)
For point 2, one way to solve this would be to autoscale when there is not enough cpu units for placing a new jenkins slave.
You should use the cpu reservation metric on the cluster to scale.
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html#cluster_reservation

It is possible use AutoScaling with Elastic Mapreduce?

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?