High QPS load testing for Kubernetes Applications

High QPS load testing for Kubernetes Applications - amazon-web-services

I'm running an application in the following mode
Trained a ML model in tensorflow
Created an API using Fast API and wrapped it around the ML model for inferencing.
Created a Dockerfile and containerized the whole application
Pushed the image to ECR
Created an EKS environment (2 Nodes - 1 GPU, 1 CPU)
Deployed to EKS as a k8s pod running the above image as a container inside the pod.
Enabled HPA (Horizontal Pod Autoscaling) to achieve scaling
We're able to achieve a high QPS (Query Per Second) ~15 QPS using the above architecture but we need to scale it to 50 for a client.
One approach is to add more nodes and scale the application using Node AutoScaling by AWS.
I was also looking into AWS Lambda functions, although intriguing, I can't really use them since one of my pod i.e API end-point needs to have a GPU enabled for fast inferencing.
I was wondering if there is a better approach of dealing this.

Related

Cluster nodes only used by internal pods

We are using GKE to host our apps with Anthos, our default node pool ils set to autoscale but I noticed that out of 5 running pods, only 2 are hosting our actual services.
All the others are running internal services like this:
The issue with that is that there's not enough room for running our own services. I guess these are vital for the cluster otherwise the cluster would autoscale and the nodes would get removed.
What would be the best approach to solve this issue? I thought of upgrading the nodes machine type to allow more resources per node and have more room within them and thus have less running nodes, but I wanted to make sure I was not simply missing something on how GKE works.
I've been now digging for quite some time but it seems that would be my only option.

GKE itself requires several add-on resources which are deployed as part of your cluster. You can fine tune the resource usage of some of the GKE add-ons for smaller clusters. Additionally, Anthos each Anthos capability you enable typically deploys a set of controllers as well. GKE and Anthos try to minimize the compute resources used by these services / controllers, but you do need to account for them when calculating the right size(s) for your nodes. A good rule of thumb is to assume that system services/controllers will use ~1 vCPU when using GKE/Anthos (it's typically lower than that, but it makes things easier). So if your workloads all request >=1 vCPU, you'll likely need to use nodes that have a minimum of 4 vCPUs. You'll also want to enable the cluster autoscaler for your node pools if you don't want to pre-provision everything.
A better option would be to use node auto-provisioning as in this case you don't need to create/manage your own node pools as GKE will automatically add/remove nodes / node pools based on the resources requested by your deployments.

How to scale a heavy video rendering server?

I'm working on a video rendering server in Node.js. It spawns multiple headless chrome and uses puppeteer library to capture the screenshot and feed them to ffmpeg. Later it concats all the parts with some post processing.
Now I want to move it to production but the it's not performing efficiently.
Tried serverless architecture and cloudrun etc but still unable to achieve it. Also they clearly mention that these are not meant for heavy and long running tasks. The video is taking too much to get rendered and even longer than time my laptop takes.
I tried using GCE, the results are satisfactory but now I'm having hard time to scale it. Actually the server can only handle one request at a time efficiently. How to scale horizontally and make sure that each get only one request at a time?
Thanks in advance.

To scale up number of identical instances you can use Managed Instance Groups. Have a look at the autoscaling documentation to get better understanding how it works but basically it says:
You can autoscale based on one or more of the following metrics that reflect the load of the instance group:
Average CPU utilization.
HTTP load balancing serving capacity, which can be based on either utilization or requests per second.
Cloud Monitoring metrics.
If you will be autoscaling based on the CPI uage the just enable autoscaling and set it up when creating a new group of instances;
Here's an example gcloud command to do this:
gcloud compute instance-groups managed set-autoscaling example-managed-instance-group \
--max-num-replicas 20 \
--target-cpu-utilization 0.60 \
--cool-down-period 90
You can also use any available metric to scale up your group or even create a new custom metric that will trigger scaling up your group;
You can create custom metrics using Cloud Monitoring and write your own monitoring data to the Monitoring service. This gives you side-by-side access to standard Google Cloud data and your custom monitoring data, with a familiar data structure and consistent query syntax. If you have a custom metric, you can choose to scale based on the data from these metrics.
And last - I've found this example use case that scales up group of VM's based on pub/sub queue which might be the solution you're looking for.

how to get budget estimate for copilot?

When I select "scheduled job" while initiating resources, how is the process handled internally?
Can I verify the container in ECS? I guess it will use batch jobs for this option.
# copilot init
Note: It's best to run this command in the root of your Git repository.
Welcome to the Copilot CLI! We're going to walk you through some questions
to help you get set up with an application on ECS. An application is a collection of
containerized services that operate together.
Which workload type best represents your architecture? [Use arrows to move, type to filter, ? for more help]
> Load Balanced Web Service
Backend Service
Scheduled Job
What will be the charges if I select backend service or scheduled job?

Copilot uses Fargate containers under the hood; therefore, your charges for a backend service are based on the number of containers you have running and the CPU/memory size of those containers. The minimum container size is 0.25 vCPU and 512 GB of reserved memory.
For other service types, your pricing depends on a few more things.
Load Balanced Web Service
Fargate containers based on size and number (~$9/month for the smallest possible container)
Application Load Balancer (about $20/month depending on traffic)
Backend Service
Fargate containers based on size and number (~$9/month for the smallest possible container)
Scheduled Job
Fargate containers based on size, number, and invocation frequency and duration (ie you only pay for the minutes you use)
State Machine transitions The first 4000 transitions in a month are free, which corresponds to an invocation frequency of about once every 21 minutes assuming there are no retry transitions. Transitions after that limit are billed at a low rate.
Other notes
All Copilot-deployed resources are grouped with a number of resource tags. You can use those tags to understand billing activity, and even add your own tags via the --resource-tags flag in copilot svc deploy or copilot job deploy.
The tags we use to logically group resources are the following:
Tag Name
Value
copilot-application
name of the application this resource belongs to
copilot-environment
name of the environment this resource belongs to
copilot-service
name of the service or job this resource belongs to
The copilot-service tag is used for both jobs and services for legacy reasons.

Copilot refers to these entities as common cloud architectures [1].
I could not find an official document which outlines how these architectures are composed in detail. I guess it might be an implementation detail from the creators' perspective when you look at point one of the AWS Copilot CLI charter [2]:
Users think in terms of architecture, not of infrastructure. Developers creating a new microservice shouldn't have to specify VPCs, load balancer settings, or complex pipeline configuration. They may not know anything about other AWS services. They should be able to specify what "kind" of application it is and how it fits into their overall architecture; the infrastructure should be generated from that.
I have to agree that more sophisticated users always ask themselves how costs of a specific architecture will look like and I completely endorse the idea of having a special Copilot command such as copilot estimate costs service-xy which can be executed before creating the service.
There is some high-level documentation on the architecture types Load Balanced Web Service and Backend Service. [3]
It mentions the command copilot svc show in conjunction with the --resources flag [4]:
You can also provide an optional --resources flag to see all AWS resources associated with your service.
I think this gives you the ability to estimate costs right after bringing the services up and running.
A somehow more complicated approach which I frequently apply to understand complex constructs in the AWS CDK is to look at the source code. For example, you could open the corresponding Go file for the Load Balanced Web Service architecture: [5]. Digging into the code, you'll notice that they make it pretty clear that they are using Fargate containers instead of EC2 instances.
That is also what they tell us in the high-level service docs [4]:
You can select a Load Balanced Web Service and Copilot will provision an application load balancer, security groups, an ECS Service and run your service on Fargate.
More on Fargate: [6]
Btw, there is a really interesting comment in the issue section which outlines why they decided against supporting EC2 in the first place [7]:
What features have the team currently explicitly decided against?
EC2 comes to mind. I think we could have built a really nice experience ontop of EC2 instances - but I think there's a difference between building an "abstraction" around ECS / Fargate and building an "illusion" around ECS / EC2. What I mean by that is that if we created the illusion of a fully hands off EC2 experience, customers might be surprised that they are expected to be in charge of patching and security maintenance of those instances. This isn't something that Copilot, a CLI can really automate for people (realistically). We're still trying to figure out a good way to expose EC2 to folks - but we definitely see Fargate as the future.

Launch and shutting down instances suited for AWS ECS or Kubernetes?

I am trying to create a certain kind of networking infrastructure, and have been looking at Amazon ECS and Kubernetes. However I am not quite sure if these systems do what I am actually seeking, or if I am contorting them to something else. If I could describe my task at hand, could someone please verify if Amazon ECS or Kubernetes actually will aid me in this effort, and this is the right way to think about it?
What I am trying to do is on-demand single-task processing on an AWS instance. What I mean by this is, I have a resource heavy application which I want to run in the cloud and have process a chunk of data submitted by a user. I want to submit a this data to be processed on the application, have an EC2 instance spin up, process the data, upload the results to S3, and then shutdown the EC2 instance.
I have already put together a functioning solution for this using Simple Queue Service, EC2 and Lambda. But I am wondering would ECS or Kubernetes make this simpler? I have been going through the ECS documenation and it seems like it is not very concerned with starting up and shutting down instances. It seems like it wants to have an instance that is constantly running, then docker images are fed to it as task to run. Can Amazon ECS be configured so if there are no task running it automatically shuts down all instances?
Also I am not understanding how exactly I would submit a specific chunk of data to be processed. It seems like "Tasks" as defined in Amazon ECS really correspond to a single Docker container, not so much what kind of data that Docker container will process. Is that correct? So would I still need to feed the data-to-be-processed into the instances via simple queue service, or other? Then use Lambda to poll those queues to see if they should submit tasks to ECS?
This is my naive understanding of this right now, if anyone could help me understand the things I've described better, or point me to better ways of thinking about this it would be appreciated.

This is a complex subject and many details for a good answer depend on the exact requirements of your domain / system. So the following information is based on the very high level description you gave.
A lot of the features of ECS, kubernetes etc. are geared towards allowing a distributed application that acts as a single service and is horizontally scalable, upgradeable and maintanable. This means it helps with unifying service interfacing, load balancing, service reliability, zero-downtime-maintenance, scaling the number of worker nodes up/down based on demand (or other metrics), etc.
The following describes a high level idea for a solution for your use case with kubernetes (which is a bit more versatile than AWS ECS).
So for your use case you could set up a kubernetes cluster that runs a distributed event queue, for example an Apache Pulsar cluster, as well as an application cluster that is being sent queue events for processing. Your application cluster size could scale automatically with the number of unprocessed events in the queue (custom pod autoscaler). The cluster infrastructure would be configured to scale automatically based on the number of scheduled pods (pods reserve capacity on the infrastructure).
You would have to make sure your application can run in a stateless form in a container.
The main benefit I see over your current solution would be cloud provider independence as well as some general benefits from running a containerized system: 1. not having to worry about the exact setup of your EC2-Instances in terms of operating system dependencies of your workload. 2. being able to address the processing application as a single service. 3. Potentially increased reliability, for example in case of errors.
Regarding your exact questions:
Can Amazon ECS be configured so if there are no task running it
automatically shuts down all instances?
The keyword here is autoscaling. Note that there are two levels of scaling: 1. Infrastructure scaling (number of EC2 instances) and application service scaling (number of application containers/tasks deployed). ECS infrastructure scaling works based on EC2 autoscaling groups. For more info see this link . For application service scaling and serverless ECS (Fargate) see this link.
Also I am not understanding how exactly I would submit a specific
chunk of data to be processed. It seems like "Tasks" as defined in
Amazon ECS really correspond to a single Docker container, not so much
what kind of data that Docker container will process. Is that correct?
A "Task Definition" in ECS is describing how one or multiple docker containers can be deployed for a purpose and what its environment / limits should be. A task is a single instance that is run in a "Service" which itself can deploy a single or multiple tasks. Similar concepts are Pod and Service/Deployment in kubernetes.
So would I still need to feed the data-to-be-processed into the
instances via simple queue service, or other? Then use Lambda to poll
those queues to see if they should submit tasks to ECS?
A queue is always helpful in decoupling the service requests from processing and to make sure you don't lose requests. It is not required if your application service cluster can offer a service interface and process incoming requests directly in a reliable fashion. But if your application cluster has to scale up/down frequently that may impact its ability to reliably process.

AWS Load Balancer + NginX + EC2 AutoScaling Group

Currently, I dont have AWS Load Balancer setup yet.
Request comes to a single ec2 instance: first hits nginx which then gets forwarded to node/express.
Now, I want to create an autoscaling group, and attach AWS load balancer to distribute the request that comes in. I am wondering if this is a good setup:
Request -> AWS Load Balancer -> Nginx A + EC2 A
-> Nginx B + EC2 B
-> ... C + ... C
Nginx is installed on the same EC2 that has node.js running on it. Nginx config has logic to detect user's location using the geoip module, as well as gzip compression configs and ssl handling.
I will also move the ssl handling to the load balancer.

Ideally (if Nginx can be decoupled from specific Node tasks) you'd want an auto scaling group dedicated to each service and I'd suggest using containerization for this because this is exactly what its meant for, though all this will obviously require some non-trivial changes to your program...
This will enable...
Efficient Resource Allocation
Select instance types with the ideal mix of CPU/RAM/Network/Storage per service (Node or Nginx)
Maintain granular control over the amount of tasks running relative their actual demand.
Intelligent Scaling
Thresholds set to initialize scaling actions need to reflect the resources they're running. You may not want to say, double your more compute intensive Node capacity, when there are spikes in simple read operations to your program. By segmenting the services by resources the thresholds can be tied to the resource your service demands the most of. You may want to scale...
Nginx based on maximum inbound requests over 1 minute period
Node based on average CPU Utilization over 5 minute period
The 'chunks' that your instances are broken into relative to the size of you tasks also makes a big difference in how efficient they will scale. Here's an exaggerated example, on just one service...
1 EC2 t3.large running 5 Node tasks # 50% RAM Utilization.
AS group Hits 70% or whatever thresholds you assigned, scales-out 1 instance
2 instances now running say 6 Node tasks # 30% RAM Utilization
This causes 2 problems...
You're now wasting a lot of money
Possibly more importantly... what is you scale-in threshold? 20% Utilization?
The tighter the gaps of you upper and lower scaling bounds the more efficient you'll be. When the tasks you're running are all homogenous you can add and remove in smaller and more precise 'chunks' of resources.
In the scenario above you'd ideally want something like...
3 t3.small instances running 5 Node tasks
AS group hits 70% Utilization, scales-out 1 instance
Now you have 6 tasks on 4 instances at 50% utilization
Utilization drops to 40% scale-in 1 instance.
You can obviously still do all of this running Node and Nginx on the same underlying resources, but the mathematics of it all gets pretty crazy and makes your system brittle.
(I've simplified the above to target Memory Utilization on the AS group, but in application you'd have ECS adding tasks based on Utillzation, which then adds to the memory Reservation of the cluster, which would then initiate the AS actions.)
Simplified & Efficient Deployment
You don't want to be redeploying your whole Node code base for every update to Nginx configuration.
Simplifies Blue/Green deployments testing and rollbacks.
Minimize the resources you have to spin up for the 'Blue' portion of you deployments.
Use customized AMIs with pre installed binaries if needed fro only the serivce dependent on them
Whether you want to do it immediately or not (and you will) this configuration will allow you to move to Spot Instances to handle more of your variable workloads. Like all of this, you can still use Spot Instances with the configuration you've laid out, but handling termination procedures efficiently and without disruptions is a whole other mess and when you get to that you want the rest of this very organized and working smoothly.
ECS
NLB
I don't know what you're using for deployment, but AWS CodeDeploy will work beautifully with ECS to manage you container clusters as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js