hadoop version 2 (YARN) execution scenario outcome? - mapreduce

Suppose we have 5 container in our YARN system. We have two jobs to run. Job1 has 8 Map tasks and 2 Reduce tasks. Job2 have 4 Map and 1 reduce task.
How will the YARN system decide to run which tasks first?
and How many mapper and reducers will start concurrently?

How will the YARN system decide to run which "tasks" first?
It is a map reduce job. So map tasks are executed first. Now the order (i guess that this is your real question), of execution of the jobs depends on the scheduler used. fifo uses first in first out - this is not used anymore in production environments since we have options such as capacity and fair scheduler. This is a broad topic again. https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html The execution also depends on the cluster resources available at the time of submitting the jobs.
How many mapper and reducers will start concurrently?
Reducers (at least the reduce method) will be executed only after
all the map tasks are completed. You have mentioned about the number of containers but not the number of nodes.
Concurrent execution depends on the memory you allocate to the map and reduce tasks. Take a look at these properties: yarn.scheduler.minimum-allocation-mb, yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.reduce.memory.mb.
Update 1: Shuffle and sort will start the moment one of the map tasks is completed. This means while the other map tasks are still being executed, the partitioned (and combined data if at all a combiner is run) mapper output will be transferred to the reducer. But reduce method will be called only after this transfer procedure is done (for all map tasks after they are completed). Yes, container allocation would have happened by then.

Related

Determining the number of parallel processes in a multi-instance subprocess

I am modelling a process which at times will require a very large number of parallel sub-processes (tens of thousands) to be launched. Obviously it’s not possible for these all to run in parallel simultaneously - how will the Camunda process engine handle this? Is it possible to control how many subprocesses will run at a time?
Camunda 7 uses a job executor thread pool. This determines the concurrency level of jobs such an asynchronously started call activity.
The amount of sub processes you mentioned is very high though. What history level did you have in mind? It is likely better to handle this differently.
Camunda 8 was release two days ago. It has a fundamentally different architecture, no relational DB, applying event streaming concepts, designed for massive volumes. It may be more suitable for your use case.

Can you guarantee order with a large number of jobs in AWS Batch?

My problem is running a job after thousands of jobs finish running on AWS Batch.
I have tried run the job in a job queue with lower priority and run the job in the same queue but submiting after all the others (the documentation says that the jobs are executed in approximately the order that they are submitted). But my question is if any one of these (or some other) guarantees that it will run after the others ?.
I wouldn't rely on a guarantee using the above methods. Execution order is explicitly not guaranteed to match submission order. Priority "should" work, but at large scale it's likely at some point something will delay your high priority execution and cause the scheduler to decide it has resources to spare for the low priority queue.
You can rely on job dependencies. They allow you to specify that one job depends on another N jobs, and therefore must wait until they all finish to begin running. This can even be chained - A depends on B, B depends on C, guarantees order C -> B -> A. Unfortunately, N <= 20.
The best way to depend on a lot of jobs (more than 20) is to depend on a single array job, with all those jobs inside it. On a related note, an array job can also be configured to make all its jobs serially dependent (jobs execute in array order). The only caveat is you have to put all your jobs into an array. On the off-chance your thousands of jobs you want to depend on aren't already in an array, there are ways of manipulating them into one - for example, if you're processing 1000 files, you can put the files in a list, and have each array job index into the list using its job index.

Remove multiple tasks from scheduling queue

I want to schedule and run tasks in parallel in a multi-node distributed memory cluster.
I have many different tasks which have dependencies on other tasks, but I also want to run the same tasks in parallel with different parameters.
For example, say I have a task which has a desired optimal parameter of 10 but there is a chance that this might fail (due to certain complexities within the specific task - which cannot be known beforehand). To hedge the risk of failure I also want to run the same task with the next best parameter. To continue hedging risk, I want to run even more tasks with decreasing next best parameters.
There is an implied hierarchy amongst the tasks based upon the optimal parameter. All tasks with a lesser optimal parameter can be considered hedges to the better value.
My main question is:
Since I'm running the same tasks in parallel with different parameters, and the tasks, based on the optimality of the parameter, are prioritized in the sense that as soon as any instance of that task completes successfully, I want to kill all subsequent instances of the task which have lower priority (have worse parameters) and remove them from the process queue.
In other words, I want to kill off the hedge processes when I know that a better parameter task has completed successfully.
Does DAGuE allow this? i.e. the removal of tasks from the queue? If not, can another C++ scheduler be suggested.

task delegation scheduler

I implemented a scheduler task delegation scheduler instead of a task stealing scheduler. So the basic idea of this method is each thread has its own private local queue. Whenever a task is produced, before the task gets enqueued to the local queues, a search operation is done among the queues and minimum size queue is found by comparing each size of the queues. Each time this minimum size queue is used to enqueue the task. This is a way of diverting the pressure of the work from a busy thread's queue and delegate the jobs to the least busy thread's queue.
The problem in this scheduling technique is, we dont know how much time each tasks takes to complete. ie. the queue may have a minimal count, but the task may be still operating, on the other hand the queue may have higher value counter, but the tasks may be completed very soon. any ideas to solve this problem?
I am working on linux, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
It seems that your scheduling policy doesn't fit the job at hand. Usually this type of naive-scheduling which ignores task completion times is only relevant when tasks are relatively equal in execution time.
I'd recommend doing some research. A good place to start would be Wikipedia's Scheduling article but that is of course just the tip of the iceberg.
I'd also give a second (and third) thought to the task-delegation requirement since timeslicing task operations allows you to fine grain queue management by considering the task's "history". However, if clients are designed so that each client consistently sends the same "type" of task, then you can achieve similar results with this knowledge.
As far as I remember from my Queueing Theory class the fairest (of them all;) system is the one which has a single queue and multiple servers. Using such system ensures the lowest expected average execution time for all tasks and the largest utilization factor (% of time it works, I'm not sure the term is correct).
In other words, unless you have some priority tasks, please reconsider your task delegation scheduler implementation.

Is it safe to use hundreds of threads if they're only created once?

Basically I have a Task and a Thread class,I create threads equal to the amount of physical cores(or logical cores,since on Intel CPU cores they're double the count).
So basically threads take tasks from a list of tasks and execute them.However,I do have to make sure everything is safe and multiple threads don't try to take the same task at once and of course this introduces extra overhead(and headaches).
What I put the tasks functionality inside the threads?I mean - instead of 4 threads grabbing tasks from a pool of 200 tasks,why not 200 threads that execute in groups of 4 by 4,basically I won't need to synchronize anything,no locking,no nothing.Of course I won't be creating the threads during the whole run-time,just at the initialization.
What pros and cons would such a method have?One problem I can thin of is - since I only create the threads at initialization,their count is fixed,while with tasks I can keep dumping more tasks in the task pool.
Threads have cost - each one requires space for a TLS and for a stack as a minimum.
Keeping your Task and Thread classes separate would be a cleaner and more managable approach in the long run, and keep overhead down by allowing you to limit how many Threads are created and running at any given time (also, a Task is likely to take up less memory than a Thread, and be faster to create and free when needed). A Task is what controls what gets done. A Thread is what controls when a Task is run. Yes, you would need to store the Task objects in a thread-safe list, but that is very simple to implement using a critical section, mutexe, semaphore, etc. On Windows specifically, you could alternatively use an I/O Completion Port instead to submit Tasks to Threads, and let the OS handle the synchornization and scheduling for you.
It will definitely take longer to have 200 threads running at once than it is to run 4 threads to run 200 "tasks". You can test this by a simple program that does some simple math (e.g. calculate the first 20000000 prime, by asking each thread to do 100000 numbers at a time, then grabbing the next lot, or making 200 threads with 100000 numbers each).
How much slower? Don't know, depends on so many things.