OpenMP: How to utilize recursive function in each thread?

OpenMP: How to utilize recursive function in each thread? - c++

#include <stdio.h>
#include<array>
#include<vector>
#include <omp.h>
std::vector<int> pNum;
std::array<int, 4> arr;
int pGen(int);
int main()
{
pNum.push_back(2);
pNum.push_back(3);
pGen(10);
for (int i = 0; i < pNum.size(); i++)
{
printf("%d \n", pNum[i]);
}
printf("top say: %d", pNum.size());
getchar();
}
int pGen(int ChunkSize)
{
//
if (pNum.size() == 50) return 0;
int i, k, n, id;
int state = 0;
//
#pragma omp parallel for schedule(dynamic) private(k, n, id) num_threads(4)
for (i = 1; i < pNum.back() * pNum.back(); i++)
{
//
id = omp_get_thread_num();
n = pNum.back() + i * 2;
for (k = 1; k < pNum.size(); k++)
{
//
if (n % pNum[k] == 0) break;
if (n / pNum[k] <= pNum[k])
{
//
#pragma omp critical
{
//
if (state == 0)
{
//
state = 1; pNum.push_back(n); printf("id: %d; number: %d \n", id, n); pGen(ChunkSize); break;
}
}
}
}
if (state == 1) break;
}
}
This is my code above. I am trying to find first 50 prime number with openMP scheduling for each dynamic, static and guided. I started with dynamic. And somehow I realized I have to use recursive function since I cant use do - while in parallel structures.
When I debug the code above, console opens up and close down immediately, I can only see "id:0, number:5" and an "error: blablabla(something)"
The strange thing is I never get to getchar() and output the vector I use to store prime numbers. I think this is about recursion function. Any other theories?
edit: I happened to catch the error:
this is the error

I don't know if this is significant for your algorithm, but since you add numbers in your pNum vector during the main loop, pNum.back() will change over iterations. Therefore, the boundaries of the parallelised loop will change during the loop itself: for (i = 1; i < pNum.back() * pNum.back(); i++)
This isn't supported by OpenMP. Loops can only be parallelised with OpenMP if they are in Canonical Loop Form. The link explains it in details, but it boils down for you that the boundaries should be known and fixed prior to entering the loop:
lb and b: Loop invariant expressions of a type compatible with the type of var
Therefore, your code has an Undefined Behaviour. It may or may not compile, may or may not run and can give whatever result if any (or just reformat your hard drive).
If it is not important that pNum.back() evolves over iterations, then you can simply evaluate it prior to the loop and use this value as upper bound in the for statement. But if it is important, then you'll have to find another method to parallelise your loop.
Finally, a side note: this algorithm uses nested parallelism, but you didn't explicitly allow it so, as the nested parallelism is disabled by default, only the outermost call to pGen() will generate OpenMP threads.

Related

Threads fighting over global variables in class using OpenMP task

I am currently trying to find out does a clique of size k exist in an undirected graph using OpenMP to make the code run faster.
This is the code which I am trying to paralleize:
bool Graf::doesCliqueSizeKExistParallel(int k) {
if (k > n) return false;
clique_parallel.resize(k);
bool foundClique = false;
#pragma omp parallel
#pragma omp single
for (int i = 0; i < n; i++) {
if (degree[i] >= k - 1) {
clique_parallel[0] = i;
#pragma omp task
doesCliqueSizeKExistParallelRecursive(i + 1, 1, k, foundClique);
}
}
return foundClique;
}
void Graf::doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique) {
for (int j(node); j < n; j++) {
if (degree[j] >= k - 1) {
clique_parallel[currentLength - 1] = j;
bool isClique=true;
for(int i(0);i<currentLength;i++){
for(int l(i+1);l<currentLength;l++){
if(!neighbors[clique_parallel[i]][clique_parallel[l]]){isClique=false; break;}
}
if(!isClique) break;
}
if (isClique) {
if (currentLength < k)
doesCliqueSizeKExistParallelRecursive(j + 1, currentLength + 1, k, foundClique);
else {
foundClique= true;
return;
}
}
}
}
}
The problem here, which I suppose could be the case is that the variables degree, neighbors, clique_parallel are all global and when some thread is trying to write in one of these variables, another one comes and writes in that variable instead of the right thread. The only solution which I tried was to pass, these three variables as a copy to the function so that each thread has its own variable, and it didn't work. I am trying not to use #pragma omp taskwait because that would just be the sequential algorithm and there wouldn't be any speed up. Currently I am lost and don't how to fix this issue (if it is an issue) and don't know what else to try or how to avoid sharing these variables between threads.
Here is the class Graf:
class Graf {
int n; // number of nodes
vector<vector<int>> neighbors; //matrix adjacency
vector<int> degree; //number of nodes each node is adjacent to
vector<int> clique_parallel;
bool directGraph;
void doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique);
public:
Graf(int n, bool directGraph = false);
void addEdge(int i, int j);
void printGraf();
bool doesCliqueSizeKExistParallel(int k);
};
So my question is, here in this code the problem that the threads are fighting over the global variables, or could it be somethin else? Any help is useful, and if you have any question regarding the code, I'll answer.

Your observation that omp task wait turns this into the sequential algorithm is sort of correct. It's in fact worse: it turns your Depth-First search algorithm into effectively a Breadth-First, which would traverse the whole space.
Ok. First of all use task group, which has an implicit task wait at the end of a for loop that generates tasks.
Next, let your tasks return either false, or the values of a clique found.
Now for the big trick: once one task has found a solution, call omp cancel task group, which makes you leave the for loop and you can keep the values you found! This cancel will kill all other tasks (and their recursively spawned tasks) at that level. And now the magic of recursion kicks in and all groups get cancelled higher up the tree.
I once figured this out for another recursive search problem, but I'm sure you can translate it to your problem: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#Treetraversal

Parallelism on inner selected sort loop skips few numbers

I am trying to make a program, where you can set the amount of threads you want, and it will parallelize selection sort algorithm with the given data and amount of threads. I know I should just use another algorithm in this case, but its for educational purposes only. So I run into a problem when parallelizing inner loop in selection sort algorithm some close numbers are left unsorted, but the whole array is sorted apart those few pairs of numbers inside and I cant find out why.
int* selectionSort(int arr[], int size, int numberOfThreads)
{
int i, j;
int me, n, min_idx;
bool canSwap = false;
#pragma omp parallel num_threads(numberOfThreads) private(i,j,me,n)
{
me = omp_get_thread_num();
n = omp_get_num_threads();
printf("Hello from %d/%d\n", me, n);
for (i = 0; i < size - 1; i++) {
min_idx = i;
canSwap = true;
#pragma omp barrier
#pragma omp for
for (j = i + 1; j < size; j++) {
if (arr[j] < arr[min_idx])
min_idx = j;
//printf("I am %d processing %d,%d\n", me, i, j);
}
printf("Min value %d ---- %d \n", arr[min_idx], min_idx);
#pragma omp critical(swap)
if(canSwap)
{
swap(&arr[min_idx], &arr[i]);
canSwap = false;
}
#pragma omp barrier
}
}
return arr;
}

I found out that the problem is that you can't really parallelize this algorithm (well at least in a way I'm doing it), since I'm comparing the arr[j] with arr[min_idx],
min_idx value can sometimes get changed in such particular time that other thread will have finished the if (arr[j] < arr[min_idx]) line and right after that another thread would change the min_idx value which would sometimes make just completed if statement not true anymore.

openMP C++ simple parallel region - inconsistent output

As stated above, I have been trying to craft a simple parallel loop, but it has inconsistent behaviour for different number of threads. Here is my code (testable!)
#include <iostream>
#include <stdio.h>
#include <vector>
#include <utility>
#include <string>
using namespace std;
int row = 5, col = 5;
int token = 1;
int ar[20][20] = {0};
int main (void)
{
unsigned short j_end = 1, k = 1;
unsigned short mask;
for (unsigned short i=1; i<=(row + col -1); i++)
{
#pragma omp parallel default(none) shared(ar) firstprivate(k, row, col, i, j_end, token) private(mask)
{
if(i > row) {
mask = row;
}
else {
mask = i;
}
#pragma omp for schedule(static, 2)
for(unsigned short j=k; j<=j_end; j++)
{
ar[mask][j] = token;
if(mask > 1) {
#pragma omp critical
{
mask--;
}
}
} //inner loop - barrier
}//end parallel
token++;
if(j_end == col) {
k++;
j_end = col;
}
else {
j_end++;
}
} // outer loop
// print the array
for (int i = 0; i < row + 2; i++)
{
for (int j = 0; j < col + 2; j++)
{
cout << ar[i][j] << " ";
}
cout << endl;
}
return 0;
} // main
I believe most of the code is self explanatory, but to sum it up, I have 2 loops, the inner one iterates through the inverse-diagonals of the square matrix ar[row][col], (row & col variables can be used to change the total size of ar).
Visual aid: desired output for 5x5 ar (serial version)
(Note: This does happen when OMP_NUM_THREADS=1 too.)
But when OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4 the output looks like this:
The serial (and for 1 thread) code is consistent so I don't think the implementation is problematic. Also, given the output of the serial code, there shouldn't be any dependencies in the inner loop.
I have also tried:
Vectorizing
threadpivate counters for the inner loop
But nothing seems to work so far...
Is there a fault in my approach, or did I miss something API-wise that led to this behavior?
Thanks for your time in advance.

Analyzing the algorithm
As you noted, the algorithm itself has no dependencies in the inner or outer loop. An easy way to show this is to move the parallelism "up" to the outer loop so that you can iterate across all the different inverse diagonals simultaneously.
Right now, the main problem with the algorithm you've written is that it's presented as a serial algorithm in both the inner and outer loop. If you're going to parallelize across the inner loop, then mask needs to be handled specially. If you're going to parallelize across the outer loop, then j_end, token, and k need to be handled specially. By "handled specially," I mean they need to be computed independently of the other threads. If you try adding critical regions into your code, you will kill all performance benefits of adding OpenMP in the first place.
Fixing the problem
In the following code, I parallelize over the outer loop. i corresponds to what you call token. That is, it is both the value to be added to the inverse diagonal and the assumed starting length of this diagonal. Note that for this to parallelize correctly, length, startRow, and startCol must be calculated as a function of i independently from other iterations.
Finally note that once the algorithm is re-written this way, the actual OpenMP pragma is incredibly simple. Every variable is assumed to be shared by default because they're all read-only. The only exception is ar in which we are careful never to overwrite another thread's value of the array. All variables that must be private are only created inside the parallel loop and thus are thread-private by definition. Lastly, I've changed the schedule to dynamic to showcase that this algorithm exhibits load-imbalance. In your example if you had 9 threads (the worst case scenario), you can see how the thread assigned to i=5 has to do much more work than the thread assigned to i=1 or i=9.
Example code
#include <iostream>
#include <omp.h>
int row = 5;
int col = 5;
#define MAXSIZE 20
int ar[MAXSIZE][MAXSIZE] = {0};
int main(void)
{
// What an easy pragma!
#pragma omp parallel for default(shared) schedule(dynamic)
for (unsigned short i = 1; i < (row + col); i++)
{
// Calculates the length of the current diagonal to consider
// INDEPENDENTLY from other i iterations!
unsigned short length = i;
if (i > row) {
length -= (i-row);
}
if (i > col) {
length -= (i-col);
}
// Calculates the starting coordinate to start at
// INDEPENDENTLY from other i iterations!
unsigned short startRow = i;
unsigned short startCol = 1;
if (startRow > row) {
startCol += (startRow-row);
startRow = row;
}
for(unsigned short offset = 0; offset < length; offset++) {
ar[startRow-offset][startCol+offset] = i;
}
} // outer loop
// print the array
for (int i = 0; i <= row; i++)
{
for (int j = 0; j <= col; j++)
{
std::cout << ar[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
} // main
Final points
I want to leave with a few last points.
If you are only adding parallelism on a small array (row,col < 1e6), you will most likely not get any benefits from OpenMP. On a small array, the algorithm itself will take microseconds, while setting up the threads could take milliseconds... slowing down execution time considerably from your original serial code!
While I did rewrite this algorithm and change around variable names, I tried to keep the spirit of your implementation as best as I could. Thus, the inverse-diagonal scanning and nested loop pattern remains.
There is a better way to parallelize this algorithm to avoid load balance, though. If instead you give each thread a row and have it instead iterate its token value (i.e. row/thread 2 places the numbers 2->6), then each thread will work on exactly the same amount of numbers and you can change the pragma to schedule(static).
As I mentioned in the comments above, don't use firstprivate when you mean shared. A good rule of thumb is that all read-only variables should be shared.
It is erroneous to assume that getting correct output when running parallel code on 1 thread implies the implementation is correct. In fact, barring disastrous use of OpenMP, you are incredibly unlikely to get the wrong output with only 1 thread. Testing with multiple threads reveals that your previous implementation was not correct.
Hope this helps.
EDIT: The output I get is the same as yours for a 5x5 matrix.

How to apply openMP to a C++ function to validate all rows of a sudoku puzzle solution?

I am designing a program that will test to see whether a valid sudoku puzzle solution is given to the program or not. I first designed it in C++ but now I want to try to make it parallel. The program compiles fine without errors.
First I had to figure out a way to deal with using a return statement inside of a structured block. I just decided to make an array of bool's that are initialized to true. However the output from this function is false and I know for a fact the solution I am submitting is true. I am new to openMP and was wondering if anyone could help me out?
I have a feeling the issue is with my variable a getting set back to 0 and maybe also with my other variable nextSudokuNum getting set back to 1.
bool test_rows(int sudoku[9][9])
{
int i, j, a;
int nextSudokuNum = 1;
bool rowReturn[9];
#pragma omp parallel for private(i)
for(i = 0; i < 9; i++)
{
rowReturn[i] = true;
}
#pragma omp parallel for private(i,j) \
reduction(+: a, nextSudokuNum)
for(i = 0; i < 9; i++)
{
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
nextSudokuNum = 1;
}
for(i = 0; i < 9; i++)
{
if(rowReturn[i] == false) {
cout << "Invalid Sudoku Solution(Next Valid Sudoku Number Not Found)" << endl;
cout << "Check row " << (i+1) << endl;
return false;
}
}
cout << "Valid sudoku rows(Returning true)" << endl;
return true;
}

Disclaimer:
First off, do not parallelize very small loops or loops which execute nearly instantaneously. The overhead of creating the threads will dominate the benefit you would get by executing the inner statements of the loop in parallel. So unless each iteration you are parallelizing performs thousands-millions of FLOPs, the serial version of the code will run faster than the parallel version of the code.
Therefore, a better plan for parallelizing your (probable) tasks is to parallelize at a higher level. That is, presumably you are calling test_rows(sudoku), test_columns(sudoku), and test_box(sudoku) from one function somewhere else. What you can do is call these three serial functions in parallel using OpenMP sections where calling each of these three functions is a separate OpenMP section. This will only benefit from using 3 cores of your CPU, but presumably you are doing this on your laptop anyway so you probably only have 2 or 4 anyway.
Now to your actual problems:
You are not parallelizing over j, but merely over i. Therefore, you can see that your variable nextSudokuNum is not being reduced; for every i iteration, nextSudokuNum is self-contained. Thus it should be initialized inside the loop and made private in the #pragma omp parallel clause.
Likewise, you are not performing a reduction over a either. For every iteration of i, a is set, compared to, and incremented internally. Again it should be a private variable.
Therefore, your new code should look like:
#pragma omp parallel for private(i,j,a,nextSudokuNum)
for(i = 0; i < 9; i++)
{
// all private variables must be set internal to parallel region before being used
nextSudokuNum = 1;
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
}

Parallelize counting for-loops with openmp

I have an 2d-image where I want to count all colors and store the result in an array. I know the number of colors, so I can set the size of the array before. My problem now is that the counting lasts too long for me. How can I speed the counting up with OpenMP?
My current serial code is
std::vector<int> ref_color_num_thread;
ref_color_num.resize(ref_color.size());
std::fill(ref_color_num.begin(), ref_color_num.end(), 0);
ref_color_num_thread.resize(ref_color.size());
std::fill(ref_color_num_thread.begin(), ref_color_num_thread.end(), 0);
for (int i = 0; i < image.width(); i++)
{
for (int j = 0; j < image.height(); j++)
{
for (int k = 0; k < (int)ref_color.size(); k++)
{
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread[k]++;
}
}
}
First approaches were setting #pragma omp parallel for at each loop (each try at another), but everytime I get a program crash because of wrong memory access. Do I have to use private() for my vector?

What you're doing is filling a histogram of your colors. This is equivalence to doing an array reduction in C/C++ with OpenMP. In C/C++ OpenMP does not have built in support for this (but it does in Fortran due to the fact that the array size is known in Fortran where in C/C++ it's only known for static arrays). However, it's easy to do an array reduction in C/C++ with OpenMP yourself.
#pragma omp parallel
{
std:vector<int> ref_color_num_thread_private(ref_color.size(),0);
#pragma omp for
for (int i = 0; i < image.width(); i++) {
for (int j = 0; j < image.height(); j++) {
for (int k = 0; k < (int)ref_color.size(); k++) {
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread_private[k]++;
}
}
}
#pragma omp critical
{
for(int i=0; i<(int)ref_color.size(); i++) {
ref_color_num_thread[i] += ref_color_num_thread_private[i];
}
}
}
I went into a lot more detail about his here Fill histograms (array reduction) in parallel with OpenMP without using a critical section
I showed how to an array reduction without a critical section but it's a lot more tricky. You should test the first case and see if it works well for you first. As long as the number of colors (ref_color.size()) is small compared to the number of pixels it should parallelize well. Otherwise, you might need to try the second case without a critical section.

There is a race condition if one of the outer two loops (i or j) are parallized, because the inner loop iteratates over the vector (k). I think your crash is because of that.
You have to restructure your program. It is not trivial, but one idea is that each thread uses a local copy of the ref_color_num_thread vector. Once the computation is finished, you can sum up all the vectors.
If k is large enough to provide enough parallelism, you could exchange the loops. Instead of "i,j,k" you could iterate in the order "k,i,j". If I'm not mistaken, there are no violated dependencies. Then you can parallelize the outer k loop, and let the inner i and j loops execute sequentially.
Update:
pragma omp for also supports reductions, for example:
#pragma omp for reduction(+ : nSum)
Here is a link to some documentation.
Maybe that can help you to restructure your program.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP: How to utilize recursive function in each thread? - c++

Related

Threads fighting over global variables in class using OpenMP task

Parallelism on inner selected sort loop skips few numbers

openMP C++ simple parallel region - inconsistent output

How to apply openMP to a C++ function to validate all rows of a sudoku puzzle solution?

Parallelize counting for-loops with openmp

Categories

Resources