Use OpenMP to find minimum for sets in parallel, C++ - c++

I'm implementing Boruvka's algorithm in C++ to find minimum spanning tree for a graph. This algorithm finds a minimum-weight edge for each supervertex (a supervertex is a connected component, it is simply a vertex in the first iteration) and adds them into the MST. Once an edge is added, we update the connected components and repeat the find-min-edge, and merge-supervertices process, until all the vertices in the graph are in one connected component.
Since find-min-edge for each supervertex can be done in parallel, I want to use OpenMP to do this. Here is the omp for loop I would like to use for parallel find-min.
int index[NUM_VERTICES];
#pragma omp parallel private(nthreads, tid, index, min) shared(minedgeindex, setcount, forest, EV, umark)
{
#pragma omp for
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
min = 9999;
std::fill_n(index, NUM_VERTICES, -1);
/* Gets minimum edge for each supervertex */
for(int i = 0; i < NUM_VERTICES; i++) {
if(forest[i]->mark == umark[k]){ //find vertices with mark k
for(int j = 0; j < NUM_EDGES; j++) {
//check min edge for each vertex in the supervertex k
if(EV[j].v1==i){
if(Find(forest[EV[j].v1])!= Find(forest[EV[j].v2])){
if(EV[j].w <= min ){
min = EV[j].w;
index[k] = j;
break; //break looping over edges for current vertex i, go to next vertex i+1
}
}
}
}
}
} //end finding min disjoint-connecting edge for the supervertex with mark k
if(index[k] != -1){
minedgeindex.insert(minedgeindex.begin(), index[k]);
}
} //omp for end
}
Since I'm new to OpenMP, I currently cannot make it work as I expected.
Let me briefly explain what I'm doing in this block of code:
setcount is the number of supervertices. EV is a vector containing all edges (Edge is a struct I defined previously, has attributes v1, v2, w which correspond to the two nodes it connects and its weight). minedgeindex is a vector, I want each thread to find min edge for each connected component, and add the index (index j in EV) of the min edge to vector minedgeindex at the same time. So I think minedgeindex should be shared. forest is a struct for each vertex, it has a set mark umark indicating which supervertex it's in. I use Union-Find to mark all supervertices, but it is not relevant in this block of omp code.
The ultimate goal I need for this block of code is to give me the correct vector minedgeindex containing all min edges for each supervertex.
To be more clear and ignore the graph background, I just have a large vector of numbers, I separate them into a bunch of sets, then I need some parallel threads to find the min for each set of numbers and give me back the indices for those mins, store them in a vector minedgeindex.
If you need more clarification just ask me. Please help me make this work, I think the main issue is the declaration of private and shared variables which I don't know if I'm doing right.
Thank you in advance!

Allocating an array outside of a parallel block and then declaring it private is not going to work.
Edit: After reading through your code again it does not appear that index should even be private. In that case you should just declare it outside the parallel block (as you did) but not declare it private. But I am not sure you even need index to be an array. I think you can just declare it as an private int.
Additionally, you can't fill minedgeindex like you did. That causes a race condition. You need to put it in a critical section. Personally I would try and use push_back and not insert from the beginning of the array since that's inefficient.
Some people prefer to explicitly declare everything shared and private. In standard C you sorta have to do this, at least for private. But for C99/C++ this is not necessary. I prefer to only declare shared/private if it's necessary. Everything outside of the parallel region is shared (unless it's an index used in a parallel loop) and everything inside is private. If you keep that in mind you rarely have to explicitly declare data shared or private.
//int index[NUM_VERTICES]; //index is shared
//std::fill_n(index, NUM_VERTICES, -1);
#pragma omp parallel
{
#pragma omp for
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
int min = 9999; // min is private
int index = -1;
//iterate over supervertices
if(index != -1){
#pragma omp critical
minedgeindex.insert(minedgeindex.begin(), index);
//minedgeindex.insert(minedgeindex.begin(), index[k]);
}
}
}
Now that the code is working here are some suggestions to perhaps speed it up.
Using the critical declaration inside the loop could be very inefficient. I suggest filling a private array (std::vector) and then merging them after the parallel loop (but still in the parallel block). The loop has an implicit barrier which is not necessary. This can be removed with nowait.
Independent of the critical section the time to find each minimum can vary per iteration so you may want to consider schedule(dynamic). The following code does all this. Some variation of these suggestions, if not all, may improve your performance.
#pragma omp parallel
{
vector<int> minedgeindex_private;
#pragma omp for schedule(dynamic) nowait
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
int min = 9999;
int index = -1;
//iterate over supervertices
if(index != -1){
minedgeindex_private.push_back(index);
}
}
#pragma omp critical
minedgeindex.insert(
minedgeindex.end(),
minedgeindex_private.begin(), minedgeindex_private.end());
}

This is not going to work efficiently with openMP, because omp for simply splits the work statically between all threads, i.e. each threads gets a fair share of the supervertices. However, the work per supervertex may be uneven, when the work-sharing between treads not be even.
You can try to use dynamic or guided schedule with openMP, but better is to avoid openMP altogether and use TBB, when tbb::parallel_for() avoids this issue.
OpenMP has several disadvantages:
1) it is pre-processor based
2) it has rather limited functionality (this is what I highlighted above)
3) it isn't standardised for C++ (in particular C++11)
TBB is a pure C++ library (no preprocessor hack) with full C++11 support. For more details, see my answer to this question

Related

Aspects that affects the efficiency of OpenMP parallelism

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

OpenMP: Is array reduction always needed for updating an array in parallel?

I am quite new to OpenMP. I have the following simple loop that I want to run in parallel with OpenMP:
double rij[3];
double r;
#ifdef _OPENMP
#pragma omp parallel for private(rij,r)
#endif
for (int i=0; i<n; ++i)
{
for (int j=0; j<n; ++j)
{
if (i != j)
{
distance(X,rij,r,i,j);
V[i] += ke * Q[j] / r;
for (int k=0; k<3; ++k)
{
F[3*i+k] += ke * Q[j] * rij[k] / pow(r,3);
}
}
}
}
From what I understood, variables are shared by default which is why I only declared private(rij,r). But according to these questions (first second third), I should do array reduction in this case.
It's clear to me that if many threads need to sum to the same variable, this has to be done with #pragma omp parallel for reduction(+:A[:n]) for summing to array A of size n. This is what I do in another part of my code, and it works as expected.
However, in this case workers never have to sum to the same variable: every worker performs the sum on its index i. Is is correct to do as I do in this case i.e. not doing any array reduction and not using any critical section ?
If my implementation is correct, I believe it would avoid the overhead of the critical section while being simpler code. Feel free to give your advice on how this could be better optimized.
Thank you
You don't need a reduction. It is a feature to avoid copying the same code all over again because they are re-occurring problems (Try to think off, how you would implement a sum-reduction without OpenMP).
What you do right now is working on parallel data (V[i]) which should not overlap at any iteration (as you state in the question), because you divide by i itself. Furthermore write to F[...] shouldn't overlap either, because it only depends on iand k

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

Can a thread-local copy of select elements be created of a shared 2D array in a parallel region? (Shared, private, barrier: OPenMP)

I have a 2-D grid of nxn elements. In one iteration, I'm calculating the value of one element by averaging the values of its neighbors. That is:
for(int i=0;i<n;i++)
for(int j=0;j<n;j++)
grid[i][j] = (grid[i-1][j] + grid[i][j-1] + grid[i+1][j] + grid[i][j+1])/4.0;
And I need to run the above nested loop for iter number of iterations.
What I need is the following:
I need the threads to calculate this average, wait till all the threads have finished calculating and THEN update the grid in one go.
The loop with iter iterations will run sequentially, but during every iteration, the value of grid[i][j] for every i and j should be calculated in parallel.
In order to do that I have the following ideas and questions:
Maybe make grid shared and put a copy of the select 4 elements of the grid that is needed for calculating grid[i][j] by making only those 4 elements private to the thread. (Basically grid is shared by all threads, but there is a local copy of 4 iteration-specific elements in every thread too.) Is this possible?
Would a barrier be in fact needed for all the threads to finish and then start onto the next iteration?
I'm very new to the OpenMP way of thinking and I'm utterly lost in this simple problem. I'd be grateful if somebody could help resolve my confusion.
In practice, you'd want to have (much) fewer threads than grid points, so each thread will be calculating a whole bunch of points (for example, one row). There is a certain overhead associated with starting OpenMP (or any other kind of) threads, and you program will be memory-bound rather than CPU-bound anyway. So starting a thread per grid point will defeat the whole purpose of parallelizing the computation. Hence, your idea #1 is not recommended (I am not quite sure I understood it correctly though; maybe this is not what you were proposing).
I would recommend (also pointed out by others in OP comments) you allocate twice the memory needed to store the grid values and use two pointers that are swapped between iterations: one points to memory holding previous iteration values that are read only, the other one to new iteration values that are write-only. Note that you will only swap the pointers, not actually copy the memory. After your iteration is done, you can copy the final result into desired location.
Yes, you need to synchronize threads between iterations, however in OpenMP this is usually done implicitly simply by opening a parallel region within the iteration loop (there is an implicit barrier at the end of a parallel region):
for (int iter = 0; iter < niter; ++iter)
{
#pragma omp parallel
{
// get range of points for current thread
// loop over thread's points and apply the stencil
}
}
or, using a parallel for construct:
const int np = n*n;
for (int iter = 0; iter < niter; ++iter)
{
#pragma omp parallel for
for (int ip = 0; ip < np; ++ip)
{
const int i = ip / n;
const int j = ip % n;
// apply the stencil to [i,j]
}
}
The second version will auto-distribute the work evenly between the available threads, which is most likely what you want. In the first you have to do it manually.

OpenMP parallel code has not the same output as the serial code

I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.