Kubernetes (GKE/AWS/Azure) Scaling for Large Jobs

Kubernetes (GKE/AWS/Azure) Scaling for Large Jobs - amazon-web-services

I am looking for some advice, and I would be eternally grateful if anyone would be able to point me in the right direction.
I have a docker container that I use to do machine learning based object detection/tracking across sets of video frames. Currently, I start up an ec2 instance with this docker container, and then send batches of approximately 30 frames in serial fashion. If course, this is prohibitively slow.
I would like to set up a kubernetes system that can go from zero running containers to 50+, then immediately down to minimum required. Each container requires about 8 Gb of RAM due to the model size but can run on CPU. I would need these to run for about one minute to process the incoming images in parallel and then terminate, scaling down to zero active containers after the video processing is complete. In summary, send small batches of 30 frames to the cluster, have it scale up massively, and then scale down immediately when done.
I was able to set up a kubernetes cluster on Google cloud, but I cannot figure out how to make it scale all the way down to zero quickly after the job terminates. Having so many containers running after the job is done would be very expensive.
Would anybody be able to point me in the right direction? Can I do this with gke? Is there a different service I should try?
Many thanks in advance for your help.
N

If I've understood your task clearly, it's Parallel Processing with Kubernetes you're looking for. With this feature of K8S, you can run a certain job with multiple pods running parallelly and those pods are terminated when the job is done.
You can read more from the following documentation links -
https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/
https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/

Related

AWS Step Functions vs Luigi for orchestration

My team had a monolithic service for a small scale project but for a re-architecture and scaling, we are planning to move to cloud services of Amazon AWS and evaluating for orchestration whether to run Luigi as a container task or use AWS Step Functions instead? I don't have any experience with any of them especially Luigi.
Can anyone point out any issues that they have seen with Luigi or how it can prove to be better than AWS if at all? Any other suggestions for the same.
Thanks in advance.

I don't know about how AWS does orchestration, but if you are planning to at any time scale to at least thousands of jobs, I would not recommend investing in Luigi. Luigi is extremely useful for small to medium(ish) projects. It provides a fantastic interface for defining jobs and ensuring job completion through atomic filesystem actions. However, the problem when it comes to Luigi is the framework for running jobs. Luigi requires constant communication to workers for them to run, which in my own experience destroyed network bandwidth when I tried to scale.
For my research, I will generate a network of 10,000 tasks on a light to medium workflow, using my university's cluster computing grid which runs SLURM. All of my tasks don't take that long to complete, maybe 5 min max each. I have tried the following three methods to use Luigi efficiently.
SciLuigi's slurm task to submit jobs to SLURM from a central luigi worker (not using central scheduler). This method works well if your jobs will be accepted quickly and run. However, it uses an unreasonable amount of resources on the scheduling node, as each worker is a new process. Further, it destroys any priority you would have in the system. A better method would be to first allocate many workers and then have them continually work on jobs.
The second method I attempted was just that. I started the Luigi central scheduler on my home server (because otherwise I could not monitor the state of work, just like in the above workflow) and started up workers on the SLURM cluster that all had the same configuration, so each of them could run any part of the experiment. The problem was, even with 500Mbps internet, past ~50 workers Luigi would stop functioning and so would my internet connection to my server. So, I began running jobs with only 50 workers, which drastically slowed my workflow. In addition, each worker had to register each job with the central scheduler (another huge pain point), which could take hours with only 50 workers.
To mitigate this startup time I decided to partition the root-task subtrees by their parameters and submit each to SLURM. So now the startup time is reasonably low, but I lost the ability for any worker to run any job, which is still pretty important. Also, I can still only work with ~50 workers. When I completed the subtrees, I ran one last job to finish the experiment.
In conclusion, Luigi is great for small to medium-small workflows, but once you start hitting 1,000+ tasks and workers, the framework quickly fails to keep up. I hope that my experiences provide some insight into the framework.

How many threads/processes to create in an ECS task

A c5.2xlarge instance has 8 vCPU. If I run os.cpu_count() (Python) or std::thread::hardware_concurrency() (C++) they each report 8 on this instance. I assume the underlying hardware is probably a much bigger machine, but they are telling me what I have available to me, and that seems useful and correct.
However, if my ECS task requests only 2048 CPU (2 vCPU), then it will still get 8 from the above queries on a c5.2xlarge machine. My understanding is Docker is going to limit my task to only using "2 vCPU worth" of CPU, if other busy tasks are running. But it's letting me see the whole instance.
It seems like this would lead to tasks creating too many threads/processes.
For example, if I'm running 2048 CPU tasks on a c5.18xlarge instance, each task will think it has 72 cores available. They will all create way too many threads/processes overall; it will work but be inefficient.
What is the best practice here? Should programs somehow know their ECS task reservation? And create threads/processes according to that? That seems good except then you might be under-using an instance if it's not full of busy tasks. So I'm just not sure what's optimal there.
I guess the root issue is Docker is going to throttle the total amount of CPU used. But it cannot adjust the number of threads/processes you are using. And using too many or too few threads/processes is inefficient.
See discussion of cpu usage in ECS docs.
See also this long blog post: https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/

There is a huge difference between virtualization technologies and containers. Having a clear understanding of these technologies will help. That being said an application should be configurable if you want to deploy it in different environments.
I would suggest creating an optional config which tells the application that it can only use certain number of cpu cores. If that value is not provided then it falls back to auto detect.
Once you have this option when defining ECS task you can provide this optional config, which will fix the problem you are facing.

speed up boot time of compute engine instances

I've written a simple batch file that starts apache and sends a curl request to my server at start time. I am using windows server 2016 and n-4 compute engine instance.
I've noticed that 2 identical machines require vastly different start up times. One sends a message in just 40s, other one takes almost 80s. While in console, both seem to start at the same time, the reality is different, since the other one is inaccessible for 80s via RD tools.
The second machine is made from disk image of the first one. What factors contribute to the start time? Where should I trip the fat?

The delay could occur if the instances are in different regions and also if the second instance has some additional memory intensive applications or additional customizations done. The boot disk type for the instance also contributes to the booting time. Are you getting any information from the logs about this delay during the startup time? You could also compare traceroute results on both instances to see if there is a delay at some point in the network.

How to minimize Google Cloud launch latency

I have a persistent server that unpredictably receives new data from users, needing about 10 GPU instances to crank at the problem for about 5 minutes, and I send the answer back to the users. The server itself is a cheap always-persistent single CPU Google Cloud instance. When a user request comes in, my code launches my 10 created but stopped Google Cloud GPU instances with
gcloud compute instances start (instance list)
In the rare case if the stopped instances don't exist (sometimes they get wiped) that's detected and they're recreated with
gcloud beta compute instances create (...)
This system all works fine. My only complaint is that even with created but stopped instances, the launch time before my GPU code finally starts to run is about 5 minutes. Most of this is just the time for the instance itself to launch its Ubuntu host and call my code.. the delay once Ubuntu is running to start the GPU is only about 10 seconds.
How can I reduce this 5 minute delay? I imagine most of it comes from Google having to copy over the 4GB of instance data to the target machine, but the startup time of (vanilla) Ubuntu adds probably 1 more minute. I'm not even sure if I could quantify these two numbers independently, I only can measure the combined 3-7 minutes delay from the launch until my code starts responding.
I don't think Ubuntu OS startup time is the major startup latency contributor since I timed an actual machine with the same Ubuntu and same GPU on my desk from poweron boot up and it began running my GPU code in 46 seconds.
My goal is to get results back to my users as soon as possible, and that 5 minute startup delay is a bottleneck.
Would making a smaller instance SIZE of say 2GB help? What else can I do to reduce the latency?

2GB is large. That's a heckuva big image. You should be able to cut that down to 100MB, perhaps using Alpine instead of Ubuntu.
Copying 4GB of data is also less than ideal. Given that, I suspect the solution will be more of an architecture change than a code change.
But if you want to take a whack at everything which is NOT about your 4GB of data, there is a capability to prepare a custom image for your VMs. If you can build a slim custom image that will help.
There's good resources for learning more, the two I would start with include:
- Improve GCE Boot Times with Custom Images
- Three steps to Compute Engine startup-time bliss: Google Cloud Performance Atlas

AWS EC2 ECS - How many tasks should I place on a single instance?

At the moment, I have a single c4.large (3.75GB RAM, 2 vCPU) instance in my workers cluster, currently running 21 tasks for 16 services. These tasks range from image processing, to data transformation, most sending HTTP requests too. As you can see, the instance is quite well utilisated.
My question is, how do I know how many tasks to place on an instance? I am placing up to 8 tasks for a service, but I'm unsure as to whether this results in a speed increase, given they are using the same underlying instance. How do I find the optimal placement?
Should I put many chefs in my kitchen, or will just two get the food out to customers faster?

We typically run lots of smaller sized server in our clusters. Like 4-6 t2.small for our workers and place 6-7 tasks on each. The main reason for this is not to speed up processing but reduce the blast radius of servers going down.
We've seen it quite often for a server to simply fail an instance health check and AWS take it down. Having the workers spread out reduces the effect on the system.

I agree with the other people’s 80% rule. But you never want a single host for any kind of critical applications. If that goes down you’re screwed. I also think it’s better to use larger sized servers because of their increase network performance. You should look into a host with enhanced networking, especially because you say you have a lot of HTTP work.
Another thing to consider is disk I/O. If you are piling too many tasks on a host and there is a failure, it’s going to try to schedule those all somewhere else. I have had servers crash because of too many tasks being scheduled and burning through disk credits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js