Flask application scaling on Kubernetes and Gunicorn

Flask application scaling on Kubernetes and Gunicorn - flask

We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.
The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.
Does this calculation still apply in a kubernetes pod using an async worker particularly as:
There could be multiple pods on a single node.
The same service will be run in multiple pods.
How should I set the number of gunicorn workers?
Set it to -w 1 and let kubernetes handle the scaling via pods?
Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
Something else entirely?
Update
We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.

For better visibility of the final solution chosen by original author of this question as of 2019 year
Set the number of gunicorn works to 1 (-w 1), and scale horizontally
by increasing the number of pods (using Kubernetes HPA).
and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.

I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.
In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)
You can use also Resource requests and limits of Pod and Container.
As per Gunicorn documentation
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
# update:
Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods
Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Note:
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
f.e. 1 GCP Core.
As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.
On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...
Hope this help.

Related

In GKE, how to minimize connect time with Load balancer

In GKE, for cost-saving, I usually put the node number to zero. When I autoscale nodes(or say add) and run the pods. It takes more than 6-7 mins to connect to Loadbalancer and up the URL. That's why health checks in the waiting state. Is there any way to reduce the time? Thanks

If Cloud Functions is not an option, you might want to look at Cloud Run (which supports containers and scales to zero) or GKE Autopilot (which does not scale to zero, but you can scale down to low resource and it will autoscale up and down as needed)

In short not really. Spinning up time of nodes is not easily controlled, basically it is the time that will take for the VM to be allocated, turned on, boot the OS and do some other stuff related to Kubernetes (like configuration, adding to node pool, etc) this takes time! In addition to Pods spinning up time which depends on the Docker image (size/dependencies etc).
Scaling down your application to zero nodes is not very recommended. It is always recommended to have some nodes up (don’t you have other apps running on the GKE cluster? Kubernetes clusters are recommended to have at least 3 nodes running).
Have you considered using Cloud Functions? Is it possible in your case? This the the best option I know of for a quick scale up and zero scale down.
And in general you can keep some kind of “ping” to the function to keep it “hot” for a relatively cheap price.
If none of the options above is possible (id say keeping your node pool with at least 3 nodes operating, is best as it is takes time for the Kubernetes control plan to boot). I suggest starting with reducing the spinning up time of your Pods by improving the Docker image - reducing its size etc.
Here are some articles on how to reduce Docker image size
https://phoenixnap.com/kb/docker-image-size
https://www.ardanlabs.com/blog/2020/02/docker-images-part1-reducing-image-size.html
After that I will experiment with different machine types for node to check which one is spinning the fastest - could be an interesting thing to do in any case
Here is an interesting comparison on VM spinning up times
https://www.google.com/amp/s/blog.cloud66.com/part-2-comparing-the-speed-of-vm-creation-and-ssh-access-on-aws-digitalocean-linode-vexxhost-google-cloud-rackspace-packet-cloud-a-and-microsoft-azure/amp/

ECS starting tasks sequentially though resources are available

In our ECS cluster setup with ASG Capacity provider, we have 5 EC2 instances and each instance can take around 20 tasks. So overall there are resources available to run 100 tasks. Now if we submit a service with 100 tasks, though there are enough resources, not all tasks are started parallely. I see tasks are coming up in batches of size 20 with a gap of 10 secs between each batch. I observed this from ECS Service Event logs. Any configuration which we can tweak to achieve complete parallelism.

This behavior is due to artificially controlled throughput (expressed in Tasks per Second - TPS) that the ECS service control plane imposes. There is a bursting concept in there (which is the reason for which you see this batch of tasks being launched and then a delta in seconds). The reasons for which these limits exist is to avoid being throttled in other parts of the services surface. These limits can be lifted if there is a strong need but the engineering team will need to validate the use case and expectations (see the point about hitting potentially other limits). The best way to address this discussion is by opening a ticket with AWS Support and explore your alternatives (based on your requirements).

k8s high availability configuration edge cases for prod

we have an app in production which need to be highly available (100%),so we did the following:
We configure 3 instance as HA but then the node died
We configure anti-affinity (to run on differents nodes) but some update done on the nodes and we were unavailable(evicted) for some min.
Now we consider to add pod disruption Budget
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
My question are:
How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?
is there any other configuration which I need to add to make sure that my pods run always (as much as possible )

How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?
Affinity and Anti-affinity is about where your Pod is scheduled, e.g. so that two replicas of the same app is not scheduled to the same node. Pod Disruption Budgets is about to increase availability when using voluntary disruption e.g. maintenance. They are both related to making better availability for your app - but not related to eachother.
Is there any other configuration which I need to add to make sure that my pods run always (as much as possible)
Things will fail. What you need to do is to embrace distributed systems and make all your workload a distributed system, e.g. with multiple instances to remove single point of failure. This is done differently for stateless (e.g. Deployment) and stateful (e.g. StatefulSet) workload. What's important for you is that your app is available at much as possible, but individual instances (e.g. Pods) can fail, almost without that any user notice it.
We configure 3 instance as HA but then the node died
Things will always fail. E.g. a physical node may crash. You need to design your apps so that it can tolerate some failures.
If you use a cloud provider, you should use regional clusters that uses three independent Availability Zones and you need to spread your workload so that it runs in more than one Availability Zone - in this way, your app can tolerate that a whole Availability Zone is down without affecting your users.

AWS Load Balancer + NginX + EC2 AutoScaling Group

Currently, I dont have AWS Load Balancer setup yet.
Request comes to a single ec2 instance: first hits nginx which then gets forwarded to node/express.
Now, I want to create an autoscaling group, and attach AWS load balancer to distribute the request that comes in. I am wondering if this is a good setup:
Request -> AWS Load Balancer -> Nginx A + EC2 A
-> Nginx B + EC2 B
-> ... C + ... C
Nginx is installed on the same EC2 that has node.js running on it. Nginx config has logic to detect user's location using the geoip module, as well as gzip compression configs and ssl handling.
I will also move the ssl handling to the load balancer.

Ideally (if Nginx can be decoupled from specific Node tasks) you'd want an auto scaling group dedicated to each service and I'd suggest using containerization for this because this is exactly what its meant for, though all this will obviously require some non-trivial changes to your program...
This will enable...
Efficient Resource Allocation
Select instance types with the ideal mix of CPU/RAM/Network/Storage per service (Node or Nginx)
Maintain granular control over the amount of tasks running relative their actual demand.
Intelligent Scaling
Thresholds set to initialize scaling actions need to reflect the resources they're running. You may not want to say, double your more compute intensive Node capacity, when there are spikes in simple read operations to your program. By segmenting the services by resources the thresholds can be tied to the resource your service demands the most of. You may want to scale...
Nginx based on maximum inbound requests over 1 minute period
Node based on average CPU Utilization over 5 minute period
The 'chunks' that your instances are broken into relative to the size of you tasks also makes a big difference in how efficient they will scale. Here's an exaggerated example, on just one service...
1 EC2 t3.large running 5 Node tasks # 50% RAM Utilization.
AS group Hits 70% or whatever thresholds you assigned, scales-out 1 instance
2 instances now running say 6 Node tasks # 30% RAM Utilization
This causes 2 problems...
You're now wasting a lot of money
Possibly more importantly... what is you scale-in threshold? 20% Utilization?
The tighter the gaps of you upper and lower scaling bounds the more efficient you'll be. When the tasks you're running are all homogenous you can add and remove in smaller and more precise 'chunks' of resources.
In the scenario above you'd ideally want something like...
3 t3.small instances running 5 Node tasks
AS group hits 70% Utilization, scales-out 1 instance
Now you have 6 tasks on 4 instances at 50% utilization
Utilization drops to 40% scale-in 1 instance.
You can obviously still do all of this running Node and Nginx on the same underlying resources, but the mathematics of it all gets pretty crazy and makes your system brittle.
(I've simplified the above to target Memory Utilization on the AS group, but in application you'd have ECS adding tasks based on Utillzation, which then adds to the memory Reservation of the cluster, which would then initiate the AS actions.)
Simplified & Efficient Deployment
You don't want to be redeploying your whole Node code base for every update to Nginx configuration.
Simplifies Blue/Green deployments testing and rollbacks.
Minimize the resources you have to spin up for the 'Blue' portion of you deployments.
Use customized AMIs with pre installed binaries if needed fro only the serivce dependent on them
Whether you want to do it immediately or not (and you will) this configuration will allow you to move to Spot Instances to handle more of your variable workloads. Like all of this, you can still use Spot Instances with the configuration you've laid out, but handling termination procedures efficiently and without disruptions is a whole other mess and when you get to that you want the rest of this very organized and working smoothly.
ECS
NLB
I don't know what you're using for deployment, but AWS CodeDeploy will work beautifully with ECS to manage you container clusters as well.

AWS ECS running a task that requires many cores

I am conceptually trying to understand how to use AWS ECS to run my "cluster" jobs.
I have some scientific software inside a Docker container, that natively takes advantage of as many cores as the underlying instance has to offer.
My question in this case is, can I use AWS ECS to "increase" the number of "visible" cores to the task running inside my Docker container. For instance, is my "cluster" limited to only a single instance? Or is a "cluster" expandable to multiple instances?
I haven't been able to find any answers my looking through he AWS docs.

Cluster is just some EC2 instances that are ECS-enabled (are running special agent software) and grouped together. Tasks that you run on this cluster are spread across these instances. Each task can involve multiple containers. However, each container stays within its instance ‘boundaries’, hardware-wise. It is allocated a number of “CPU units” and shares them with other containers running on the same instance.
From my understanding, running a process spanning multiple cores in a container is not quite fitting ECS architecture idea—it seems like trying to do part of ECS’s scheduler job.
I found these resources useful when I was reading about it:
My notes on Amazon's ECS post by Jérôme Petazzoni
Application Architecture in ECS docs
Task Definition Parameters in ECS docs

I had a similar situation moving a Python app that used a script to spawn copies of itself based on the number of cores. The answer to this isn't so much an ECS problem as it is a Docker best practice... you should strive to use 1 process per container. (see https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/)
How I ended up implementing this was using a Dockerfile to run each process and then used essential ECS tasks so it will reload itself if the task died.
Your cluster is a collection of EC2 instances with the ECS service running. Each instance has a certain number of CPU 'units' (typically 1024 units === 1 core) and RAM. I profiled my app at peak load and tweaked the mix until I got it where I liked it. If your app can use more CPU than that, try giving it 2048 CPU or some other amount and see how it performs. I used Meros (https://meros.io/) to profile my app.
Hope this helps!

"increase" the number of "visible" cores to the task running inside my Docker container
Container and cluster is different things, you may run lot of containers on one instance, but you can't run one container on multiply instances.
Cluster - it is set of docker containers.
is my "cluster" limited to only a single instance?
no, you may choose number of instances in cluster

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js