I am implemetting a MR ETL job and I have only mapper tasks and no reducer tasks.
I always see only one mapper running inside a container.
Is it possible to run multiple mapper inside a container or only one map/reduce task can run inside a container?
Container runs only one task (Map, reduce or whatever else). However multiple containers can run on the same machine at the same time, if there are enough resources available.
Related
I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:
The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run
the models for user6-10, etc.. ??
Could you please share some ideas or code on that topic for better understanding? Much appreciated.
AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.
Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.
Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.
You can see an example in the AWS Batch docs for how to use the index.
Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.
At the moment I have a load balancer which runs a Compute Engine Instance Group which has a minimum of 1 server and a maximum of 5 servers.
This is running auto scaling and use a pre-build ubuntu template with all the base stuff needed.
When an instance boots up it will log a runner into the GitLab project, and then trigger the job to update the instance to the latest copy of the code.
This is fine and works well.
The issue comes when I make a change to the git branch and push the changes, it only seems to be being picked up by one of the random 5 instances that have loaded.
I was under the impression that GitLab would push out to all the runners logged, but this doesn't seem to be the case.
I have seen answers on here that show multiple runners, but on a single server, I haven't come across my particular situation.
Has anyone come across this before? I would assume that this is a pretty normal situation, and weird that it doesn't just work.
For each job that runs in GitLab, only 1 runner receives the job. The mechanism is PULL based -- the runners constantly ask GitLab if there's any jobs available to run. GitLab never initiates communication with the runners.
Therefore, your load balancer rules do nothing to affect which runner receives a job and there is no "fairness" in distributing jobs across server. Runners will keep asking for jobs every few seconds as long as they are able to take them (according to concurrency settings in the config.toml) and GitLab will hand them out on a first-come, first-served basis.
If you set the concurrency to 1 and start multiple jobs, you should see multiple servers pick up the jobs.
I have two batch applications ex: batchapp1 and batchapp2 . I want to run batchapp2 after completion of the batchapp1. Can we achieve it using PCF scheduler or can we achive it using Spring data-flow-server?
Right now we are doing it using Control-M and run on VM's JVM (Not on cloud).
What you need is Spring Cloud Data Flow's Composed Task functionality. With this, you'd be able to orchestrate a series of Tasks/Batch-Jobs as Direct Acyclic Graph. A graph can include sequential, parallel, or both where each of the steps is a Task/Batch-job.
For your example, in SCDF, the DSL representation would look like:
task create foo --definition "batchapp1 && batchapp2"
Upon launching the Task definition foo in SCDF on PCF, it would launch batchapp1 first and upon success/failure, it would run the batchapp2 next. You can also have transitions to run cleanup/error-handling steps based on the exit-code at each step.
As an alternative, you could do all this on the interactive drag & drop visual canvas, too.
Also to note, in PCF, all the steps would be launched as short-lived CF Tasks, which run for a finite time. That would be them running as long as the App needs to run and then it is cleanly shutdown to free-up resources.
We use Jenkins for our CI build system. We also use 'concurrent builds' so that Jenkins will build each change independently. This means we often have 5 or 6 builds of the same job running simultaneously. To accommodate this, we have 4 slaves each with 12 executors.
The problem is that Jenkins doesn't really 'load balance' among its slaves. It tries to build a job on the same slave that it previously built on (presumably to reduce the time syncing from source control). This is a problem because Jenkins will build all 6 instances of our build on the same slave (or more likely between 2 slaves). One build machine gets bogged down and runs very slowly while the rest of them sit idle.
How do I configure the load balancing behavior of Jenkins, and how it controls its slaves?
We were facing a similar issue. So I've put together a plugin that changes the Load Balancer in Jenkins to select a node that currently has the least load - https://plugins.jenkins.io/leastload/
Any feedback is appreciated.
If you do not find a plugin that does it automatically, here's an idea of what you can do:
Install Node Label Parameter plugin
Add SLAVE parameter to your jobs
Restrict jobs to run on ${SLAVE}
Add a trigger job that will do the following:
Analyze load distribution via a System Groovy Script and decide on which node to start next build.
Dispatch the build on that node with Parameterized Trigger
plugin
by assigning appropriate value to SLAVE parameter.
In order to analyze load distribution you need to install Groovy plugin and familiarize yourself with Jenkins Main Module API. Here are some useful initial pointers.
If your build machines cannot comfortably handle more than 1 build, why configure them with 12 executors? If that is indeed the case, you should reduce the number of executors to 1. My Jenkins has 30 slaves, each with 1 executor.
You may also use the Throttle Concurrent Builds plugin to restrict how many instances of a job can run in parallel on the same node
I have two labels -- one for small tasks and one for big tasks. I have one executor for the big task and 4 for the small tasks. This does balance things a little.
I'm trying to figure out how to model my build process in hudson. At present most of our hudson builds are somewhat hard coded in that the build process is series of steps and we have one process per branch.
I have another build system that has many active branches and each build has a series of integration tests which require a suit of machines to execute. As I migrate from the home grown to hudson I'm not quite sure what's the right way to model this to keep sustainability costs and build times to a minimum
Here's my basic build:
create workspace
compile, link, package
transfer artifacts to test systems
invoke test harness on multiple systems to handle installation and acceptance tests
collect results
publish results
I'd like the integration part to be a group of generic machines (perhaps an elastic group) which can handle integration-tests for any branch. I want to run as much in parallel as possible to keep my build times low. It looks like the best way to execute in parallel on hudson is to break up steps into jobs and use the Parameterized Trigger Plugin to customize the generic jobs.
So, i'd have two main jobs: build, test
I could have on build job per branch and a generic test job. The build job would use Parameterized Trigger Plugin to call the test job and provide the location of the build artifacts. The test job would call a series of jobs in parallel passing down parameters for branch, artifact.
test
test-client-install (params: artifact location, branch)
test-server-install (params: artifact location, branch)
test-run (params: client machine, server machine)
join - collect results (params: client machine, server machine)
Each of the test-* jobs would pull a slave out of the group of slaves and execute. I'm not quite sure how to inform the slaves running the client and server jobs how to find each other nor am I sure how to reserve them from the pool and release them back into it.
I guess, I could have a write properties to a common share and have the sub jobs use that for inter-job communication.
Has anyone created this kind of complex setup in hudson, or is this usually done in another system with which hudson interacts (hudson + STAF with STAF managing resources)?
A few thoughts. The fewer jobs you have, the easier it is to maintain them. The more jobs you have, the more flexible you are and you can run more in parallel. Since you emphasized fast build times, have a look at the join plugin, to run a few jobs in parallel and when all of these are finished to go on with another job in that chain.
If your server is big enough, you can experiment a little bit with the clone workspace plugin. It will help you to reduce the need for manual copying files between jobs and the need for the parametrized trigger plugin.
Reserving a slave is easy. You can group the slaves with labels. In your job you define what label you node has to have in order to execute your job. A node can have more than one label and your job can be bound to more than one label. This way, Hudson decides where to put your job depending on availability. If your slaves have more than one build queue, they can run two jobs in parallel. I haven't used the locks-and-latches plugin for synchronizing across nodes. So I don't know if locks are only per node or for the whole Hudson installation. Latches are not supported yet. If you need to ensure that two jobs need to run on the same slave, try to combine them, otherwise you will lose the advantage of Hudson to distribute your jobs freely over the available nodes.