I need to parallelize kruskal's algorithm, the serial version used the union find algorithm for detecting cycle in the undirected graph. Is there any way to parallelize this part of code?
Well, it can be parallelized to some extent. It is as follows:
Initially all the edges are sorted in the ascending order. There is a main thread which actually scans each edge from the beginning and decides whether adding the current edge forms the cycle. Our main aim in parallelizing the algorithm is to make these checks parallel.
This is where we use the worker threads. Each thread is given certain number of edges to examine, where in each thread checks if its edges form a cycle with the current representation after every iteration (iteration means the main thread adding a new edge). As the main thread keeps on adding the edges, some threads see that certain edges are already forming a cycle with the current representation.
Such edges are marked as discarded. When the main thread reaches such edges, it simply moves on to the next one without making any check on it.
Thus, we have actually made these checks parallel, which means the algorithm runs quickly increasing the efficiency.
In fact, there is a nice paper that uses the same idea described above.
EDIT:
If you are pretty much concerned about the running time of over-all algorithm, you can even use a parallel sorting algorithm initially as #jarod42 suggested.
Related
Can anyone please suggest Partitioning algorithms to partition the vision algorithm (computations or workload) to expose opportunities for parallel execution by decomposing computations into small tasks
You don't need a partitioning algorithm necessarily.
In any convolution task, each pixel in the output is independent of any other output pixel. Morphological operations are similarly parallelizable, as well as the Hough Transform.
Using any of these, you could have multiple threads or processes working together. A simple implementation would have a painter that iterates over all pixels, and when a thread is free, it simply takes the current item and advances the iterator (preferably atomically, but it won't break if it isn't atomic), performs the appropriate computation, and writes the result to the output. You don't need to worry about any deadlock or race conditions because the computations are independent of each other.
I'm currently working on a project that requires a thread to construct a queue of 30(ish) nearest processes closest to the player within a 3D environment.
All of these processes can move about the environment, as well as leave their starting nodes that they were placed in. I have considered using R trees, but due to its ludicrously high insert times, it does not seem very viable.
KD- Trees would not work, since they tend to only work for static environments.
Note also that this will be running async to the main update thread, so an atomic approach would work best.
Can someone suggest an approach?
Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.
I have an implementation of Dijkstra's Algorithm, based on the code on this website. Basically, I have a number of nodes (say 10000), and each node can have 1 to 3 connections to other nodes.
The nodes are generated randomly within a 3d space. The connections are also randomly generated, however it always tries to find connections with it's closest neighbors first and slowly increases the search radius. Each connection is given a distance of one. (I doubt any of this matters but it's just background).
In this case then, the algorithm is just being used to find the shortest number of hops from the starting point to all the other nodes. And it works well for 10,000 nodes. The problem I have is that, as the number of nodes increases, say towards 2 million, I use up all of my computers memory when trying to build the graph.
Does anyone know of an alternative way of implementing the algorithm to reduce the memory footprint, or is there another algorithm out there that uses less memory?
According to your comment above, you are representing the edges of the graph with a distance matrix long dist[GRAPHSIZE][GRAPHSIZE]. This will take O(n^2) memory, which is too much for large values of n. It is also not a good representation in terms of execution time when you only have a small number of edges: it will cause Dijkstra's algorithm to take O(n^2) time (where n is the number of nodes) when it could potentially be faster, depending on the data structures used.
Since in your case you said each node is only connected to up to 3 other nodes, you shouldn't use this matrix: Instead, for each node you should store a list of the nodes it is connected to. Then when you want to go over the neighbors of a node, you just need to iterate over this list.
In some specific cases you don't even need to store this list because it can be calculated for each node when needed. For example, when the graph is a grid and each node is connected to the adjacent grid nodes, it's easy to find a node's neighbors on the fly.
If you really cannot afford memory, even with minimizations on your graph representation, you may develop a variation of the Dijkstra's algorithm, considering a divide and conquer method.
The idea is to split data into minor chunks, so you'll be able to perform Dijkstra's algorithm in each chunk, for each of the points within it.
For each solution generated in these minor chunks, consider the it as an unique node to another data chunk, from which you'll start another execution of Dijkstra.
For example, consider the points below:
.B .C
.E
.A .D
.F .G
You can select the closest points to a given node, say, within two hops, and then use the solution as part of the graph extended, considering the former points as only one set of points, with a distance equal to the resulting distance of the Dijkstra solution.
Say you start from D:
select the closest points to D within a given number of hops;
use Dijkstra's algorithm upon the selected entries, commencing from D;
use the solution as a graph with the central node D and the last nodes in the shortest paths as nodes directly linked to D;
extend the graph, repeating the algorithm until all the nodes have been considered.
Although there's a costly extra processing here, you'd be able to surpass memory limitation, and, if you have some other machines, you can even distribute the processes.
Please, note this is just the idea of the process, the process I've described is not necessarily the best way to do it. You may find something interesting looking for distributed Dijkstra's algorithm.
I like boost::graph a lot. It's memory consumption is very decent (I've used it on road networks with 10 million nodes and 2Gb ram).
It has a Dijkstra implementation, but if the goal is to implement and understand it by yourself, you can still use their graph representation (I suggest adjacency list) and compare your result with theirs to be sure your result is correct.
Some people mentioned other algorithms. I don't think this will play a big role on the memory usage, but more likely in the speed. 2M nodes, if the topology is close to a street-network, the running time will be less than a second from one node to all others.
http://www.boost.org/doc/libs/1_52_0/libs/graph/doc/index.html
Problem Description:
there is n tasks, and in these tasks, one might be dependent on the others, which means if A is dependent on B, then B must be finished before A gets finished.
1.find a way to finish these tasks as quickly as possible?
2.if take parallelism into account, how to design the program to finish these tasks?
Question:
Apparently, the answer to the first question is, topological-sort these tasks, then finish them in that order.
But how to do the job if parallelism taken into consideration?
My answer was,first topological-sort these tasks, then pick those tasks which are independent and finish them first, then pick and finish those independent ones in the rest...
Am I right?
Topological sort algorithms may give you various different result orders, so you cannot just take the first few elements and assume them to be the independent ones.
Instead of topological sorting I'd suggest to sort your tasks by the number of incoming dependency edges. So, for example if your graph has A --> B, A --> C, B --> C, D-->C you would sort it as A[0], D[0], B[1], C[3] where [i] is the number of incoming edges.
With topological sorting, you could also have gotting A,B,D,C. In that case, it wouldn't be easy to find out that you can execute A and D in parallel.
Note that after a task was completely processed you then have to update the remaining tasks, in particular, the ones that were dependent on the finished task. However, if the number of dependencies going into a task is limited to a relatively small number (say a few hundreds), you can easily rely on something like radix/bucket-sort and keep the sort structure updated in constant time.
With this approach, you can also easily start new tasks, once a single parallel task has finished. Simply update the dependency counts, and start all tasks that now have 0 incoming dependencies.
Note that this approach assumes you have enough processing power to process all tasks that have no dependencies at the same time. If you have limited resources and care for an optimal solution in terms of processing time, then you'd have to invest more effort, as the problem becomes NP-hard (as arne already mentioned).
So to answer your original question: Yes, you are basically right, however, you lacked to explain how to determine those independent tasks efficiently (see my example above).
I would try sorting them in a directed forest structure with task execution time as edge weigths. Order the arborescences from heaviest to lightest and start with the heaviest. Using this approach you can, at the same time, check for circular dependencies.
Using parallelism, you get the bin problem, which is NP-hard. Try looking up approximative solutions for that problem.
Have a look at the Critical Path Method, taken from the are of project management. It basically do what you need: given tasks with dependecies and durations, it produces how much time it will take, and when to activate each task.
(*)Note that this technique is assuming infinite number of resources for optimal solution. For limited resources there are heuristics for greedy algorithms such as: GPRW [current+following tasks time] or MSLK [minimum total slack time].
(*)Also note, it requires knowing [or at least estimating] how long will each task take.