EKS Vertical Scaling With Karpenter - amazon-web-services

I'm evaluating Karpenter (https://karpenter.sh/) and I wanted to know if there's a way to vertically scale down a large node with few pods. The only scaling actions seem to be triggered by either unschedulable pods or empty nodes.
Scenario: I scheduled 5 pods and the scheduler gave me one c5d.2xlarge instance, and that resulted in a 65% utilization (not bad). I killed 3 pods and utilization dropped as expected to 25%. I waited for a few hours to see if an optimization process would kick in but .. nothing (over 20 hours). The feature is not well documented, in fact the only reference of it is in this independent article: https://blog.sivamuthukumar.com/karpenter-scaling-nodes-seamlessly-in-aws-eks
How does it work?
Observes the pod resource requests of unscheduled pods
Direct provision of Just-in-time capacity of the node. (Groupless Node
Autoscaling)
Terminating nodes if outdated
Reallocating the pods in nodes for better resource utilization
Am I missing something? Is there a way to do this, using Karpenter or another solution? TIA

So there's a feature request on Karpenter's Github project addressing this specific issue: https://github.com/aws/karpenter/issues/1091. I'll update this answer once a solution is available.
The workaround suggested by the project team, was to set a short TTL on the nodes (like 1 day), forcing Karpenter to evaluate optimization daily.

Related

AWS ECS does not drain connections or remove tasks from Target Group before stopping them

I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.

In GKE, how to minimize connect time with Load balancer

In GKE, for cost-saving, I usually put the node number to zero. When I autoscale nodes(or say add) and run the pods. It takes more than 6-7 mins to connect to Loadbalancer and up the URL. That's why health checks in the waiting state. Is there any way to reduce the time? Thanks
If Cloud Functions is not an option, you might want to look at Cloud Run (which supports containers and scales to zero) or GKE Autopilot (which does not scale to zero, but you can scale down to low resource and it will autoscale up and down as needed)
In short not really. Spinning up time of nodes is not easily controlled, basically it is the time that will take for the VM to be allocated, turned on, boot the OS and do some other stuff related to Kubernetes (like configuration, adding to node pool, etc) this takes time! In addition to Pods spinning up time which depends on the Docker image (size/dependencies etc).
Scaling down your application to zero nodes is not very recommended. It is always recommended to have some nodes up (don’t you have other apps running on the GKE cluster? Kubernetes clusters are recommended to have at least 3 nodes running).
Have you considered using Cloud Functions? Is it possible in your case? This the the best option I know of for a quick scale up and zero scale down.
And in general you can keep some kind of “ping” to the function to keep it “hot” for a relatively cheap price.
If none of the options above is possible (id say keeping your node pool with at least 3 nodes operating, is best as it is takes time for the Kubernetes control plan to boot). I suggest starting with reducing the spinning up time of your Pods by improving the Docker image - reducing its size etc.
Here are some articles on how to reduce Docker image size
https://phoenixnap.com/kb/docker-image-size
https://www.ardanlabs.com/blog/2020/02/docker-images-part1-reducing-image-size.html
After that I will experiment with different machine types for node to check which one is spinning the fastest - could be an interesting thing to do in any case
Here is an interesting comparison on VM spinning up times
https://www.google.com/amp/s/blog.cloud66.com/part-2-comparing-the-speed-of-vm-creation-and-ssh-access-on-aws-digitalocean-linode-vexxhost-google-cloud-rackspace-packet-cloud-a-and-microsoft-azure/amp/

How to spin up all nodes in my EMR cluster before running my spark job

I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
EMR FAQ
EMR PRICING
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.

EMR cluster stuck on resizing

I have an EMR Cluster that I created back in April 2020 with 1 master(on-demand), 1 core (spot) and multiple Task nodes (spot). I have been using it actively and things had been going well until last few days ago. For some reason, the cluster has gone into "Waiting" mode as it is trying to find spot instances for the Core node. I have the provisioning set as "After 300 minutes, shift to on-demand instances). I see the status "resizing" for Core node.
I don't know what to do next. I am on Basic Support with AWS. I would really like not to terminate this cluster and rebuild it as I spent a lot of time putting my personal configuration touches on it. What could I have done better to prevent this in the future?
The Resizing of EMR Cluster have lots of issues. It is not advisable to have all spot instance from single box. I would suggest few work arounds.
STOP the resizing. Wait to see the changes . Read error message. Then again issue Resizing request.
If you don't use HDFS (say do everything from S3) , kill the Cluster and get a new one.
If you are using that, I suggest you go through https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/ .
When creating cluster which requires only spot,use 2 or 3 box-
a. 10% On-Demand Value - 1 instance
b. 20% On-Demand Value - 1 instance
I suggest you keep 1 on-demand if budget permits, and then try.

Flask application scaling on Kubernetes and Gunicorn

We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.
The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.
Does this calculation still apply in a kubernetes pod using an async worker particularly as:
There could be multiple pods on a single node.
The same service will be run in multiple pods.
How should I set the number of gunicorn workers?
Set it to -w 1 and let kubernetes handle the scaling via pods?
Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
Something else entirely?
Update
We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.
For better visibility of the final solution chosen by original author of this question as of 2019 year
Set the number of gunicorn works to 1 (-w 1), and scale horizontally
by increasing the number of pods (using Kubernetes HPA).
and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.
I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.
In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)
You can use also Resource requests and limits of Pod and Container.
As per Gunicorn documentation
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
# update:
Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods
Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Note:
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
f.e. 1 GCP Core.
As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.
On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...
Hope this help.