mapreduce vs other parallel processing solutions - mapreduce

So, the questions are:
1. Is mapreduce overhead too high for the following problem? Does anyone have an idea of how long each map/reduce cycle (in Disco for example) takes for a very light job?
2. Is there a better alternative to mapreduce for this problem?
In map reduce terms my program consists of 60 map phases and 60 reduce phases all of which together need to be completed in 1 second. One of the problems I need to solve this way is a minimum search with about 64000 variables. The hessian matrix for the search is a block matrix, 1000 blocks of size 64x64 along a diagonal, and one row of blocks on the extreme right and bottom. The last section of : block matrix inversion algorithm shows how this is done. Each of the Schur complements S_A and S_D can be computed in one mapreduce step. The computation of the inverse takes one more step.
From my research so far, mpi4py seems like a good bet. Each process can do a compute step and report back to the client after each step, and the client can report back with new state variables for the cycle to continue. This way the process state is not lost computation can be continued with any updates.
http://mpi4py.scipy.org/docs/usrman/index.html
This wiki holds some suggestions, but does anyone have a direction on the most developed solution:
http://wiki.python.org/moin/ParallelProcessing
Thanks !

MPI is a communication protocol that allows for the implementation of parallel processing by passing messages between cluster nodes. The parallel processing model that is implemented with MPI depends upon the programmer.
I haven't had any experience with MapReduce but it seems to me that it is a specific parallel processing model and is designed to be simple to implement. This kind of abstraction should save you programming time and may or may not provide a suitable solution to your problem. It all depends on the nature of what you are trying to do.
The trick with parallel processing is that the most suitable solution is often problem specific and without knowing more specifics about your problem it is hard to make recommendations.
If you can tell us more about the environment that you are running your job on and where your program fits into Flynn's taxonomy, I might be able to provide some more helpful suggestions.

Related

Offline scheduling of ordered tasks across multiple unique agents

I am trying to find any easy-to-implement algorithms for the offline scheduling of parallel Jobs comprising ordered tasks among workers to minimize makespan in the special case where the workers are unique in what they can do (rather than the typical case where workers can do any task but may take different times) subject to the constraint that a worker must finish a task before it can move onto another.
I am more concerned with ease of implementation than computational complexity as the number of workers, jobs, and tasks per job are pretty small (orders: ~10, <10, and 10-30 respectively).
The specific property of the agents being distinct in what they can do rather than how long they take to perform a task has made it hard for me to find an algorithm (or near algorithm for me to start from). When searching for algorithm , I have tried recasting this as a tiling problem when (as it's similar to stacking Gantt charts on top of each other) and have looked into how I would cast it into a graph problem to no avail.
The closest I've found so far have been dos Santos 2019, Spegal 2019, Schulz & Skutella 2002, but these require that I cast the problem as some machines taking infinite time for mismatched operations and account for other scheduling properties that are not applicable to this problem--and I do not know enough about these algorithms to know if setting them to bypassed values would break them.
I think what you're describing is known as the (inflexible) job-shop scheduling problem (with precedence ordering and non-preemptive scheduling). If you are looking for a low barrier to implementation, I would recommend an existing module like ortools. It's even defined in a manner similar to the example you provided.

How do I optimize the parallelization of Monte Carlo data generation with MPI?

I am currently building a Monte Carlo application in C++ and I have a question regarding parallelization with MPI.
The process I want to parallelize is the MC generation of data. To have good precision in my final results, I specify the goal number of data points. Each data point is generated independently, but might require vastly differing amounts of time.
How do I organize the parallelization and workload distribution of the data generation most efficiently?
What I have done so far
So far I have come up with three possible ways of organizing the MPI part of the code:
The simplest way, but most likely inefficient way: I divide the desired sample size by the number of workers and let every worker generate that amount of data in isolation. However, when the slowest worker finishes, all other workers have been idling for a potentially long time. They could have been "supporting" the slowest worker by sharing its workload.
Use a master: A master communicates with the workers who work continuously until the master process registers that we have enough data and tells everybody to stop what they are doing. The disadvantage I see is that the master process might not be necessary and could be generating data instead (especially when I don't have a lot of workers).
A "ring communication" algorithm I came up with myself: A message is continuously sent and updated in a circle (1->2, 2->3, ... , N ->1). This message contains the global number of generated data point. Once the desired goal is met, the message is tagged, circles one more time and thereby tells everybody to stop working. Important here is I use non-blocking communication (with MPI_Iprobe before receiving via MPI_Recv, and sending via MPI_Isend). This way, everybody works, and no one ever idles.
No matter, which solution is chosen, in the end I reduce all data sets to one big set and continue to process the data.
The concrete questions:
Is there an "optimal" way of parallelizing such a fairly simple process? Would you prefer any of the proposed solutions for some reason?
What do you think of this "ring communication" solution?
I'm sure I'm not the first one to come up with e.g. the ring communication algorithm. I have tried to google this problem, but it seems to me that I do not know the right terminology in this context. I'm sure there must be a lot of material and literature on such simple algorithms, but I never had a formal course on MPI/parallelization. What are the "keywords" to look for?
Any advice and tips are much appreciated.

how to apply parallelism-programming in graph problems?

Problem Description:
there is n tasks, and in these tasks, one might be dependent on the others, which means if A is dependent on B, then B must be finished before A gets finished.
1.find a way to finish these tasks as quickly as possible?
2.if take parallelism into account, how to design the program to finish these tasks?
Question:
Apparently, the answer to the first question is, topological-sort these tasks, then finish them in that order.
But how to do the job if parallelism taken into consideration?
My answer was,first topological-sort these tasks, then pick those tasks which are independent and finish them first, then pick and finish those independent ones in the rest...
Am I right?
Topological sort algorithms may give you various different result orders, so you cannot just take the first few elements and assume them to be the independent ones.
Instead of topological sorting I'd suggest to sort your tasks by the number of incoming dependency edges. So, for example if your graph has A --> B, A --> C, B --> C, D-->C you would sort it as A[0], D[0], B[1], C[3] where [i] is the number of incoming edges.
With topological sorting, you could also have gotting A,B,D,C. In that case, it wouldn't be easy to find out that you can execute A and D in parallel.
Note that after a task was completely processed you then have to update the remaining tasks, in particular, the ones that were dependent on the finished task. However, if the number of dependencies going into a task is limited to a relatively small number (say a few hundreds), you can easily rely on something like radix/bucket-sort and keep the sort structure updated in constant time.
With this approach, you can also easily start new tasks, once a single parallel task has finished. Simply update the dependency counts, and start all tasks that now have 0 incoming dependencies.
Note that this approach assumes you have enough processing power to process all tasks that have no dependencies at the same time. If you have limited resources and care for an optimal solution in terms of processing time, then you'd have to invest more effort, as the problem becomes NP-hard (as arne already mentioned).
So to answer your original question: Yes, you are basically right, however, you lacked to explain how to determine those independent tasks efficiently (see my example above).
I would try sorting them in a directed forest structure with task execution time as edge weigths. Order the arborescences from heaviest to lightest and start with the heaviest. Using this approach you can, at the same time, check for circular dependencies.
Using parallelism, you get the bin problem, which is NP-hard. Try looking up approximative solutions for that problem.
Have a look at the Critical Path Method, taken from the are of project management. It basically do what you need: given tasks with dependecies and durations, it produces how much time it will take, and when to activate each task.
(*)Note that this technique is assuming infinite number of resources for optimal solution. For limited resources there are heuristics for greedy algorithms such as: GPRW [current+following tasks time] or MSLK [minimum total slack time].
(*)Also note, it requires knowing [or at least estimating] how long will each task take.

Organizing a task-based scientific computation

I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.
there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.

Possible to distribute or parallel process a sequential program?

In C++, I've written a mathematical program (for diffusion limited aggregation) where each new point calculated is dependent on all of the preceding points.
Is it possible to have such a program work in a parallel or distributed manner to increase computing speed?
If so, what type of modifications to the code would I need to look into?
EDIT: My source code is available at...
http://www.bitbucket.org/damigu78/brownian-motion/downloads/
filename is DLA_full3D.cpp
I don't mind significant re-writes if that's what it would take. After all, I want to learn how to do it.
If your algorithm is fundamentally sequential, you can't make it fundamentally not that.
What is the algorithm you are using?
EDIT: Googling "diffusion limited aggregation algorithm parallel" lead me here, with the following quote:
DLA, on the other hand, has been shown
[9,10] to belong to the class of
inherently sequential or, more
formally, P-complete problems.
Therefore, it is unlikely that DLA
clusters can be sampled in parallel in
polylog time when restricted to a
number of processors polynomial in the
system size.
So the answer to your question is "all signs point to no".
Probably. There are parallel versions of most sequential algorithms, and for those sequential algorithms which are not immediately parallelisable there are usually parallel substitutes. This looks like be one of those cases where you need to consider parallelisation or parallelisability before you choose an algorithm. But unless you tell us a bit (a lot ?) more about your algorithm we can't provide much specific guidance. If it amuses you to watch SOers argue in the absence of hard data sit back and watch, but if you want answers, edit your question.
The toxiclibs website gives some useful insight into how one DLA implementation is done
There is cilk, which is an enhancement to the C language (unfortunately not C++ (yet)) that allows you to add some extra information to your code. With just a few minor hints, the compiler can automatically parallelize parts of your code, such as running multiple iterations of a for loop in parallel instead of in series.
Without knowing more about your problem, I'll just say that this looks like a good candidate to implement as a parallel prefix scan (http://en.wikipedia.org/wiki/Prefix_sum). The simplest example of this is an array that you want to make a running sum out of:
1 5 3 2 5 6
becomes
1 6 9 11 16 22
This looks inherently serial (as all the points depend on the ones previous), but it can be done in parallel.
You mention that each step depends on the results of all preceding steps, which makes it hard to parallelize such a program.
I don't know which algorithm you are using, but you could use multithreading for speedup. Each thread would process one step, but must wait for results that haven't yet been calculated (though it can work with the already calculated results if they don't change values over time). That essentially means you would have to use a locking/waiting mechanism in order to wait for results that haven't yet been calculated but are currently needed by a certain worker thread to go on.