Linear programming, SPM - linear-programming

As you know, spatial specialization of SPM more to a task can reduce its execution time. Suppose we want to statically divide the SPM space of a processor between a series of tasks. So that each task can finish before the deadline and the idle time of the processor is maximized (so that we can turn off the processor). Suppose the timing is of EDF type. Write the linear relations necessary to solve this problem by the linear programming method. If there are variables in the multiplication relations, there is no need for linearization.
Does anyone know a solution to this problem?
As you know, spatial specialization of SPM more to a task can reduce its execution time. Suppose we want to statically divide the SPM space of a processor between a series of tasks. So that each task can finish before the deadline and the idle time of the processor is maximized (so that we can turn off the processor). Suppose the timing is of EDF type. Write the linear relations necessary to solve this problem by the linear programming method. If there are variables in the multiplication relations, there is no need for linearization.
Does anyone know a solution to this problem?

Related

How do I optimize the parallelization of Monte Carlo data generation with MPI?

I am currently building a Monte Carlo application in C++ and I have a question regarding parallelization with MPI.
The process I want to parallelize is the MC generation of data. To have good precision in my final results, I specify the goal number of data points. Each data point is generated independently, but might require vastly differing amounts of time.
How do I organize the parallelization and workload distribution of the data generation most efficiently?
What I have done so far
So far I have come up with three possible ways of organizing the MPI part of the code:
The simplest way, but most likely inefficient way: I divide the desired sample size by the number of workers and let every worker generate that amount of data in isolation. However, when the slowest worker finishes, all other workers have been idling for a potentially long time. They could have been "supporting" the slowest worker by sharing its workload.
Use a master: A master communicates with the workers who work continuously until the master process registers that we have enough data and tells everybody to stop what they are doing. The disadvantage I see is that the master process might not be necessary and could be generating data instead (especially when I don't have a lot of workers).
A "ring communication" algorithm I came up with myself: A message is continuously sent and updated in a circle (1->2, 2->3, ... , N ->1). This message contains the global number of generated data point. Once the desired goal is met, the message is tagged, circles one more time and thereby tells everybody to stop working. Important here is I use non-blocking communication (with MPI_Iprobe before receiving via MPI_Recv, and sending via MPI_Isend). This way, everybody works, and no one ever idles.
No matter, which solution is chosen, in the end I reduce all data sets to one big set and continue to process the data.
The concrete questions:
Is there an "optimal" way of parallelizing such a fairly simple process? Would you prefer any of the proposed solutions for some reason?
What do you think of this "ring communication" solution?
I'm sure I'm not the first one to come up with e.g. the ring communication algorithm. I have tried to google this problem, but it seems to me that I do not know the right terminology in this context. I'm sure there must be a lot of material and literature on such simple algorithms, but I never had a formal course on MPI/parallelization. What are the "keywords" to look for?
Any advice and tips are much appreciated.

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

how to apply parallelism-programming in graph problems?

Problem Description:
there is n tasks, and in these tasks, one might be dependent on the others, which means if A is dependent on B, then B must be finished before A gets finished.
1.find a way to finish these tasks as quickly as possible?
2.if take parallelism into account, how to design the program to finish these tasks?
Question:
Apparently, the answer to the first question is, topological-sort these tasks, then finish them in that order.
But how to do the job if parallelism taken into consideration?
My answer was,first topological-sort these tasks, then pick those tasks which are independent and finish them first, then pick and finish those independent ones in the rest...
Am I right?
Topological sort algorithms may give you various different result orders, so you cannot just take the first few elements and assume them to be the independent ones.
Instead of topological sorting I'd suggest to sort your tasks by the number of incoming dependency edges. So, for example if your graph has A --> B, A --> C, B --> C, D-->C you would sort it as A[0], D[0], B[1], C[3] where [i] is the number of incoming edges.
With topological sorting, you could also have gotting A,B,D,C. In that case, it wouldn't be easy to find out that you can execute A and D in parallel.
Note that after a task was completely processed you then have to update the remaining tasks, in particular, the ones that were dependent on the finished task. However, if the number of dependencies going into a task is limited to a relatively small number (say a few hundreds), you can easily rely on something like radix/bucket-sort and keep the sort structure updated in constant time.
With this approach, you can also easily start new tasks, once a single parallel task has finished. Simply update the dependency counts, and start all tasks that now have 0 incoming dependencies.
Note that this approach assumes you have enough processing power to process all tasks that have no dependencies at the same time. If you have limited resources and care for an optimal solution in terms of processing time, then you'd have to invest more effort, as the problem becomes NP-hard (as arne already mentioned).
So to answer your original question: Yes, you are basically right, however, you lacked to explain how to determine those independent tasks efficiently (see my example above).
I would try sorting them in a directed forest structure with task execution time as edge weigths. Order the arborescences from heaviest to lightest and start with the heaviest. Using this approach you can, at the same time, check for circular dependencies.
Using parallelism, you get the bin problem, which is NP-hard. Try looking up approximative solutions for that problem.
Have a look at the Critical Path Method, taken from the are of project management. It basically do what you need: given tasks with dependecies and durations, it produces how much time it will take, and when to activate each task.
(*)Note that this technique is assuming infinite number of resources for optimal solution. For limited resources there are heuristics for greedy algorithms such as: GPRW [current+following tasks time] or MSLK [minimum total slack time].
(*)Also note, it requires knowing [or at least estimating] how long will each task take.

mapreduce vs other parallel processing solutions

So, the questions are:
1. Is mapreduce overhead too high for the following problem? Does anyone have an idea of how long each map/reduce cycle (in Disco for example) takes for a very light job?
2. Is there a better alternative to mapreduce for this problem?
In map reduce terms my program consists of 60 map phases and 60 reduce phases all of which together need to be completed in 1 second. One of the problems I need to solve this way is a minimum search with about 64000 variables. The hessian matrix for the search is a block matrix, 1000 blocks of size 64x64 along a diagonal, and one row of blocks on the extreme right and bottom. The last section of : block matrix inversion algorithm shows how this is done. Each of the Schur complements S_A and S_D can be computed in one mapreduce step. The computation of the inverse takes one more step.
From my research so far, mpi4py seems like a good bet. Each process can do a compute step and report back to the client after each step, and the client can report back with new state variables for the cycle to continue. This way the process state is not lost computation can be continued with any updates.
http://mpi4py.scipy.org/docs/usrman/index.html
This wiki holds some suggestions, but does anyone have a direction on the most developed solution:
http://wiki.python.org/moin/ParallelProcessing
Thanks !
MPI is a communication protocol that allows for the implementation of parallel processing by passing messages between cluster nodes. The parallel processing model that is implemented with MPI depends upon the programmer.
I haven't had any experience with MapReduce but it seems to me that it is a specific parallel processing model and is designed to be simple to implement. This kind of abstraction should save you programming time and may or may not provide a suitable solution to your problem. It all depends on the nature of what you are trying to do.
The trick with parallel processing is that the most suitable solution is often problem specific and without knowing more specifics about your problem it is hard to make recommendations.
If you can tell us more about the environment that you are running your job on and where your program fits into Flynn's taxonomy, I might be able to provide some more helpful suggestions.

Organizing a task-based scientific computation

I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.
there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.