Offline scheduling of ordered tasks across multiple unique agents - resource-scheduling

I am trying to find any easy-to-implement algorithms for the offline scheduling of parallel Jobs comprising ordered tasks among workers to minimize makespan in the special case where the workers are unique in what they can do (rather than the typical case where workers can do any task but may take different times) subject to the constraint that a worker must finish a task before it can move onto another.
I am more concerned with ease of implementation than computational complexity as the number of workers, jobs, and tasks per job are pretty small (orders: ~10, <10, and 10-30 respectively).
The specific property of the agents being distinct in what they can do rather than how long they take to perform a task has made it hard for me to find an algorithm (or near algorithm for me to start from). When searching for algorithm , I have tried recasting this as a tiling problem when (as it's similar to stacking Gantt charts on top of each other) and have looked into how I would cast it into a graph problem to no avail.
The closest I've found so far have been dos Santos 2019, Spegal 2019, Schulz & Skutella 2002, but these require that I cast the problem as some machines taking infinite time for mismatched operations and account for other scheduling properties that are not applicable to this problem--and I do not know enough about these algorithms to know if setting them to bypassed values would break them.

I think what you're describing is known as the (inflexible) job-shop scheduling problem (with precedence ordering and non-preemptive scheduling). If you are looking for a low barrier to implementation, I would recommend an existing module like ortools. It's even defined in a manner similar to the example you provided.

Related

Is openMp dynamic scheduling the same as LPT scheduling when tasks are sorted by processing time?

I am confused about dynamic scheduling and LPT scheduling(I think it is static).
What I learnt is dynamic scheduling pick tasks based on chunk sizes and when a thread has done its tasks, it picks another. LPT scheduling picks the tasks based on the longest processing time required for each task.
So, if I sort the tasks based on processing time and then I applied dynamic scheduling with chunk size 1, then is it the same as LPT scheduling or not?
For example, suppose there is a loop with 15 iterations. In each iteration, CartesianProduct of vectors are calculated. But in each itearation, the sizes of vectors are different which means the load is unbalanced. If I calculated the resulting size of each iteration and sorted them in descending order and then schedule(dynamic,1), is that the same as LPT in theory?
Firstly, OpenMP schedule clauses apply to loops, not tasks, so it is confusing to talk about tasks in this context since OpenMP also has tasks (and, even taskloop). For the avoidance of confusion, I will call the loop scheduled entity a "chunk", since it is a contiguous chunk of iterations.
Assuming you want to discuss the schedule clause on loops, then
There is a fundamental difference between a scheduling algorithm
like LPT which assumes prior knowledge of chunk execution time and
any of the algorithms permitted by OpenMP, which do not require such
knowledge.
As of OpenMP 4.5 there is now a schedule modifier
(monotonic or nonmonotonic) which can be applied to the
dynamic schedule (as well as others), and that affects the sequences which the schedule can generate.
In OpenMP 5.0 the default, undecorated, schedule(dynamic) is equivalent to schedule(nonmonotonic:dynamic) which allows sequences which which would not be possible with schedule(monotonic:dynamic), and would likely break your mapping (though, you can use schedule(monotonic:dynamic) of course!)
Since all of the OpenMP scheduling behaviour is described in terms of the execution state of the machine, it is certainly possible that a sequence will be produced that is not that which you would expect, since the machine state represents ground-truth, reflecting issues like interference from other machine load, whereas a scheduling scheme like LPT is based on assumed prior knowledge of execution time which may not be reflected in reality.
You can see a discussion of schedule(nonmonotonic:dynamic) at https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf

Resource constraint project scheduling

I need your help to solve this problem. I have a set of tasks, each task has its execution time. I have two types of constraints. first type is the precedence relationships between tasks. The second constraint type is allowing set of tasks to be in execution at the same time. For example : I have a graph G with 6 tasks and the following edges (T1,T2),(T2,T3),(T4,T3),(T4,T5) and (T6,T5). Suppose that T1,T4 are able to execute together and also T1,T6 but not T4,T6. Taking into account the execution time for each task. How to find the schedule which satisfies the precedence relationships between tasks and also minimize the length of the schedule taking into consideration the parallel execution of some tasks.
If the exclusion constraints ("T1,T4 are able to execute together") wouldn't be there (and no other constraints added), you could just start every task by taking the maximum finish time of all of its preceding tasks. That would be optimal and scale well. You would automatically get the shortest makespan. It would not be NP-complete/hard, and not job shop scheduling.
Unfortunately, the exclusion constraint (and potentially any other you add in the future) turn it into job shop scheduling (as mentioned by Lars), which is NP-complete/hard. See this video of job shop scheduling variant of an open source java implementation, that demo's why some tasks start later than their preceding tasks are finished. To solve that, look into heuristics, Metaheuristics (Tabu Search, ...) or other related techniques.
To remain simple, you can use a constructive heuristic based on priority rules along with a schedule generatiin scheme or also called SGS, see this for further reference. The heuristic will generate a ordered list of activities according to some criteria and the SGS will take this list as input and will generate the schedule. In your implementation of the SGS, you will tell if two tasks may or may not be executed in parallel basing in your second constraint.
If you want something more robust, you can use a Metaheuristic, where basically you will generate a solution (list of tasks) and modify this solution using local search techniques, exploring your solutions search space. You could generate solutions based on priority rules and evaluate them with the SGS implementation. This is just a simplified explanation of how a Metaheuristic will works, there are several variances. A good example of Metaheuristic is the Simulated Annealing, applied to the RCPSP problem: http://www.sciencedirect.com/science/article/pii/S0377221702007610.

how to apply parallelism-programming in graph problems?

Problem Description:
there is n tasks, and in these tasks, one might be dependent on the others, which means if A is dependent on B, then B must be finished before A gets finished.
1.find a way to finish these tasks as quickly as possible?
2.if take parallelism into account, how to design the program to finish these tasks?
Question:
Apparently, the answer to the first question is, topological-sort these tasks, then finish them in that order.
But how to do the job if parallelism taken into consideration?
My answer was,first topological-sort these tasks, then pick those tasks which are independent and finish them first, then pick and finish those independent ones in the rest...
Am I right?
Topological sort algorithms may give you various different result orders, so you cannot just take the first few elements and assume them to be the independent ones.
Instead of topological sorting I'd suggest to sort your tasks by the number of incoming dependency edges. So, for example if your graph has A --> B, A --> C, B --> C, D-->C you would sort it as A[0], D[0], B[1], C[3] where [i] is the number of incoming edges.
With topological sorting, you could also have gotting A,B,D,C. In that case, it wouldn't be easy to find out that you can execute A and D in parallel.
Note that after a task was completely processed you then have to update the remaining tasks, in particular, the ones that were dependent on the finished task. However, if the number of dependencies going into a task is limited to a relatively small number (say a few hundreds), you can easily rely on something like radix/bucket-sort and keep the sort structure updated in constant time.
With this approach, you can also easily start new tasks, once a single parallel task has finished. Simply update the dependency counts, and start all tasks that now have 0 incoming dependencies.
Note that this approach assumes you have enough processing power to process all tasks that have no dependencies at the same time. If you have limited resources and care for an optimal solution in terms of processing time, then you'd have to invest more effort, as the problem becomes NP-hard (as arne already mentioned).
So to answer your original question: Yes, you are basically right, however, you lacked to explain how to determine those independent tasks efficiently (see my example above).
I would try sorting them in a directed forest structure with task execution time as edge weigths. Order the arborescences from heaviest to lightest and start with the heaviest. Using this approach you can, at the same time, check for circular dependencies.
Using parallelism, you get the bin problem, which is NP-hard. Try looking up approximative solutions for that problem.
Have a look at the Critical Path Method, taken from the are of project management. It basically do what you need: given tasks with dependecies and durations, it produces how much time it will take, and when to activate each task.
(*)Note that this technique is assuming infinite number of resources for optimal solution. For limited resources there are heuristics for greedy algorithms such as: GPRW [current+following tasks time] or MSLK [minimum total slack time].
(*)Also note, it requires knowing [or at least estimating] how long will each task take.

mapreduce vs other parallel processing solutions

So, the questions are:
1. Is mapreduce overhead too high for the following problem? Does anyone have an idea of how long each map/reduce cycle (in Disco for example) takes for a very light job?
2. Is there a better alternative to mapreduce for this problem?
In map reduce terms my program consists of 60 map phases and 60 reduce phases all of which together need to be completed in 1 second. One of the problems I need to solve this way is a minimum search with about 64000 variables. The hessian matrix for the search is a block matrix, 1000 blocks of size 64x64 along a diagonal, and one row of blocks on the extreme right and bottom. The last section of : block matrix inversion algorithm shows how this is done. Each of the Schur complements S_A and S_D can be computed in one mapreduce step. The computation of the inverse takes one more step.
From my research so far, mpi4py seems like a good bet. Each process can do a compute step and report back to the client after each step, and the client can report back with new state variables for the cycle to continue. This way the process state is not lost computation can be continued with any updates.
http://mpi4py.scipy.org/docs/usrman/index.html
This wiki holds some suggestions, but does anyone have a direction on the most developed solution:
http://wiki.python.org/moin/ParallelProcessing
Thanks !
MPI is a communication protocol that allows for the implementation of parallel processing by passing messages between cluster nodes. The parallel processing model that is implemented with MPI depends upon the programmer.
I haven't had any experience with MapReduce but it seems to me that it is a specific parallel processing model and is designed to be simple to implement. This kind of abstraction should save you programming time and may or may not provide a suitable solution to your problem. It all depends on the nature of what you are trying to do.
The trick with parallel processing is that the most suitable solution is often problem specific and without knowing more specifics about your problem it is hard to make recommendations.
If you can tell us more about the environment that you are running your job on and where your program fits into Flynn's taxonomy, I might be able to provide some more helpful suggestions.

Organizing a task-based scientific computation

I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.
there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.