Running multi-node parallel job in AWS Batch using R - amazon-web-services

I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:
The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run
the models for user6-10, etc.. ??
Could you please share some ideas or code on that topic for better understanding? Much appreciated.

AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.
Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.
Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.
You can see an example in the AWS Batch docs for how to use the index.
Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.

Related

How to schedule multiple tasks on AWS ECS using different env vars?

Let's say I have task definition on AWS ECS and want to schedule it to run as multiple instances with different env variables (~20 parallel tasks). I have some ideas how to do that, but not sure which one is correct.
Create multiple task definitions with different env variables, which sounds not really effective and silly
Create multiple targets for scheduled tasks, but its number is limited to 5 ones, which doesn't work for me.
Create 20 container overrides, but I didn't found a way how to do it using the user interface, and not sure it's correct too.
Could you please give me an idea? Thanks

How to run dependency Jobs on PCF?

I have two batch applications ex: batchapp1 and batchapp2 . I want to run batchapp2 after completion of the batchapp1. Can we achieve it using PCF scheduler or can we achive it using Spring data-flow-server?
Right now we are doing it using Control-M and run on VM's JVM (Not on cloud).
What you need is Spring Cloud Data Flow's Composed Task functionality. With this, you'd be able to orchestrate a series of Tasks/Batch-Jobs as Direct Acyclic Graph. A graph can include sequential, parallel, or both where each of the steps is a Task/Batch-job.
For your example, in SCDF, the DSL representation would look like:
task create foo --definition "batchapp1 && batchapp2"
Upon launching the Task definition foo in SCDF on PCF, it would launch batchapp1 first and upon success/failure, it would run the batchapp2 next. You can also have transitions to run cleanup/error-handling steps based on the exit-code at each step.
As an alternative, you could do all this on the interactive drag & drop visual canvas, too.
Also to note, in PCF, all the steps would be launched as short-lived CF Tasks, which run for a finite time. That would be them running as long as the App needs to run and then it is cleanly shutdown to free-up resources.

Can I run 1000 AWS micro instances simultaneously?

I have a simple pure C program, that takes an integer as input, runs for a while (let's say an hour) and then returns me a text file. I want to run this program 1000 times with input integers from 1 to 1000.
Currently I'm running this program in parallel (4 processors), what takes 250 hours. The program is such that it fits in a AWS micro instance (I've tested it). Would it be possible to use 1000 micro instances in AWS to do the whole job in one hour? (at a cost of ~20$ - $0.02/instance)?
If it is possible, does anybody have some guidlines on how to do that?
If it is not, does anybody have an also low-budget alternative to that?
Thanks a lot!
In order to achieve this, you will need to:
Create a S3 bucket to store your bootstrapping scripts, application, input data and output data
Custom Lightweight AMI: you might want to create a custom lightweight AMI which knows about downloading the bootstrapping script
Bootstrapping Script: will download your software from your S3 bucket, parse a custom instance tag which will contain the integer [1..1000] and download any additional data.
Your application: does the processing stuff.
End of processing script is which uploads the result to a another result S3 bucket and terminates the instance, you might also want to send a SNS notification to communicate the end of processing status.
If you need result consolidation you might want to create another instance and use it as a coordinator, waiting for all "end of processing" notifications in order to finish the processing. In this case you might consider using in the future the Amazon's Hadoop map reduce engine, since it will do almost all this heavy lifting for free.
You didn't specify what language you'd like to do this in and since the app you use to deploy your program doesn't have to be in the same language I would recommend C#. There are a number of examples around the web on how to programmatically spawn new Amazon Instances using the SDK. For example:
How to start an Amazon EC2 instance programmatically in .NET
You can create your own AMIs with the program already present but that might end up being a pain if you want to make adjustments to it since it will require you to recreate the entire AMI. I'd recommend creating an extra instance or simply hosting the program in a location which is accessible from a public URL. Then I would create some kind of service which would be installed on the AMI ahead of time to allow me to specify the URL to download the app from along with whatever command line parameters I wanted for that particular instance.
Hope this helps.
Be carefull by default you can only spawn 20 instance by zone. You need to ask amazon in order to use more instances.
If you want low cost but don't care about delay you should use spot instances.

How can I modify the Load Balancing behavior Jenkins uses to control slaves?

We use Jenkins for our CI build system. We also use 'concurrent builds' so that Jenkins will build each change independently. This means we often have 5 or 6 builds of the same job running simultaneously. To accommodate this, we have 4 slaves each with 12 executors.
The problem is that Jenkins doesn't really 'load balance' among its slaves. It tries to build a job on the same slave that it previously built on (presumably to reduce the time syncing from source control). This is a problem because Jenkins will build all 6 instances of our build on the same slave (or more likely between 2 slaves). One build machine gets bogged down and runs very slowly while the rest of them sit idle.
How do I configure the load balancing behavior of Jenkins, and how it controls its slaves?
We were facing a similar issue. So I've put together a plugin that changes the Load Balancer in Jenkins to select a node that currently has the least load - https://plugins.jenkins.io/leastload/
Any feedback is appreciated.
If you do not find a plugin that does it automatically, here's an idea of what you can do:
Install Node Label Parameter plugin
Add SLAVE parameter to your jobs
Restrict jobs to run on ${SLAVE}
Add a trigger job that will do the following:
Analyze load distribution via a System Groovy Script and decide on which node to start next build.
Dispatch the build on that node with Parameterized Trigger
plugin
by assigning appropriate value to SLAVE parameter.
In order to analyze load distribution you need to install Groovy plugin and familiarize yourself with Jenkins Main Module API. Here are some useful initial pointers.
If your build machines cannot comfortably handle more than 1 build, why configure them with 12 executors? If that is indeed the case, you should reduce the number of executors to 1. My Jenkins has 30 slaves, each with 1 executor.
You may also use the Throttle Concurrent Builds plugin to restrict how many instances of a job can run in parallel on the same node
I have two labels -- one for small tasks and one for big tasks. I have one executor for the big task and 4 for the small tasks. This does balance things a little.

What pattern do you use for executing generic steps on a pool of machines

I'm trying to figure out how to model my build process in hudson. At present most of our hudson builds are somewhat hard coded in that the build process is series of steps and we have one process per branch.
I have another build system that has many active branches and each build has a series of integration tests which require a suit of machines to execute. As I migrate from the home grown to hudson I'm not quite sure what's the right way to model this to keep sustainability costs and build times to a minimum
Here's my basic build:
create workspace
compile, link, package
transfer artifacts to test systems
invoke test harness on multiple systems to handle installation and acceptance tests
collect results
publish results
I'd like the integration part to be a group of generic machines (perhaps an elastic group) which can handle integration-tests for any branch. I want to run as much in parallel as possible to keep my build times low. It looks like the best way to execute in parallel on hudson is to break up steps into jobs and use the Parameterized Trigger Plugin to customize the generic jobs.
So, i'd have two main jobs: build, test
I could have on build job per branch and a generic test job. The build job would use Parameterized Trigger Plugin to call the test job and provide the location of the build artifacts. The test job would call a series of jobs in parallel passing down parameters for branch, artifact.
test
test-client-install (params: artifact location, branch)
test-server-install (params: artifact location, branch)
test-run (params: client machine, server machine)
join - collect results (params: client machine, server machine)
Each of the test-* jobs would pull a slave out of the group of slaves and execute. I'm not quite sure how to inform the slaves running the client and server jobs how to find each other nor am I sure how to reserve them from the pool and release them back into it.
I guess, I could have a write properties to a common share and have the sub jobs use that for inter-job communication.
Has anyone created this kind of complex setup in hudson, or is this usually done in another system with which hudson interacts (hudson + STAF with STAF managing resources)?
A few thoughts. The fewer jobs you have, the easier it is to maintain them. The more jobs you have, the more flexible you are and you can run more in parallel. Since you emphasized fast build times, have a look at the join plugin, to run a few jobs in parallel and when all of these are finished to go on with another job in that chain.
If your server is big enough, you can experiment a little bit with the clone workspace plugin. It will help you to reduce the need for manual copying files between jobs and the need for the parametrized trigger plugin.
Reserving a slave is easy. You can group the slaves with labels. In your job you define what label you node has to have in order to execute your job. A node can have more than one label and your job can be bound to more than one label. This way, Hudson decides where to put your job depending on availability. If your slaves have more than one build queue, they can run two jobs in parallel. I haven't used the locks-and-latches plugin for synchronizing across nodes. So I don't know if locks are only per node or for the whole Hudson installation. Latches are not supported yet. If you need to ensure that two jobs need to run on the same slave, try to combine them, otherwise you will lose the advantage of Hudson to distribute your jobs freely over the available nodes.