openMP C++ simple parallel region - inconsistent output - c++

As stated above, I have been trying to craft a simple parallel loop, but it has inconsistent behaviour for different number of threads. Here is my code (testable!)
#include <iostream>
#include <stdio.h>
#include <vector>
#include <utility>
#include <string>
using namespace std;
int row = 5, col = 5;
int token = 1;
int ar[20][20] = {0};
int main (void)
{
unsigned short j_end = 1, k = 1;
unsigned short mask;
for (unsigned short i=1; i<=(row + col -1); i++)
{
#pragma omp parallel default(none) shared(ar) firstprivate(k, row, col, i, j_end, token) private(mask)
{
if(i > row) {
mask = row;
}
else {
mask = i;
}
#pragma omp for schedule(static, 2)
for(unsigned short j=k; j<=j_end; j++)
{
ar[mask][j] = token;
if(mask > 1) {
#pragma omp critical
{
mask--;
}
}
} //inner loop - barrier
}//end parallel
token++;
if(j_end == col) {
k++;
j_end = col;
}
else {
j_end++;
}
} // outer loop
// print the array
for (int i = 0; i < row + 2; i++)
{
for (int j = 0; j < col + 2; j++)
{
cout << ar[i][j] << " ";
}
cout << endl;
}
return 0;
} // main
I believe most of the code is self explanatory, but to sum it up, I have 2 loops, the inner one iterates through the inverse-diagonals of the square matrix ar[row][col], (row & col variables can be used to change the total size of ar).
Visual aid: desired output for 5x5 ar (serial version)
(Note: This does happen when OMP_NUM_THREADS=1 too.)
But when OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4 the output looks like this:
The serial (and for 1 thread) code is consistent so I don't think the implementation is problematic. Also, given the output of the serial code, there shouldn't be any dependencies in the inner loop.
I have also tried:
Vectorizing
threadpivate counters for the inner loop
But nothing seems to work so far...
Is there a fault in my approach, or did I miss something API-wise that led to this behavior?
Thanks for your time in advance.

Analyzing the algorithm
As you noted, the algorithm itself has no dependencies in the inner or outer loop. An easy way to show this is to move the parallelism "up" to the outer loop so that you can iterate across all the different inverse diagonals simultaneously.
Right now, the main problem with the algorithm you've written is that it's presented as a serial algorithm in both the inner and outer loop. If you're going to parallelize across the inner loop, then mask needs to be handled specially. If you're going to parallelize across the outer loop, then j_end, token, and k need to be handled specially. By "handled specially," I mean they need to be computed independently of the other threads. If you try adding critical regions into your code, you will kill all performance benefits of adding OpenMP in the first place.
Fixing the problem
In the following code, I parallelize over the outer loop. i corresponds to what you call token. That is, it is both the value to be added to the inverse diagonal and the assumed starting length of this diagonal. Note that for this to parallelize correctly, length, startRow, and startCol must be calculated as a function of i independently from other iterations.
Finally note that once the algorithm is re-written this way, the actual OpenMP pragma is incredibly simple. Every variable is assumed to be shared by default because they're all read-only. The only exception is ar in which we are careful never to overwrite another thread's value of the array. All variables that must be private are only created inside the parallel loop and thus are thread-private by definition. Lastly, I've changed the schedule to dynamic to showcase that this algorithm exhibits load-imbalance. In your example if you had 9 threads (the worst case scenario), you can see how the thread assigned to i=5 has to do much more work than the thread assigned to i=1 or i=9.
Example code
#include <iostream>
#include <omp.h>
int row = 5;
int col = 5;
#define MAXSIZE 20
int ar[MAXSIZE][MAXSIZE] = {0};
int main(void)
{
// What an easy pragma!
#pragma omp parallel for default(shared) schedule(dynamic)
for (unsigned short i = 1; i < (row + col); i++)
{
// Calculates the length of the current diagonal to consider
// INDEPENDENTLY from other i iterations!
unsigned short length = i;
if (i > row) {
length -= (i-row);
}
if (i > col) {
length -= (i-col);
}
// Calculates the starting coordinate to start at
// INDEPENDENTLY from other i iterations!
unsigned short startRow = i;
unsigned short startCol = 1;
if (startRow > row) {
startCol += (startRow-row);
startRow = row;
}
for(unsigned short offset = 0; offset < length; offset++) {
ar[startRow-offset][startCol+offset] = i;
}
} // outer loop
// print the array
for (int i = 0; i <= row; i++)
{
for (int j = 0; j <= col; j++)
{
std::cout << ar[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
} // main
Final points
I want to leave with a few last points.
If you are only adding parallelism on a small array (row,col < 1e6), you will most likely not get any benefits from OpenMP. On a small array, the algorithm itself will take microseconds, while setting up the threads could take milliseconds... slowing down execution time considerably from your original serial code!
While I did rewrite this algorithm and change around variable names, I tried to keep the spirit of your implementation as best as I could. Thus, the inverse-diagonal scanning and nested loop pattern remains.
There is a better way to parallelize this algorithm to avoid load balance, though. If instead you give each thread a row and have it instead iterate its token value (i.e. row/thread 2 places the numbers 2->6), then each thread will work on exactly the same amount of numbers and you can change the pragma to schedule(static).
As I mentioned in the comments above, don't use firstprivate when you mean shared. A good rule of thumb is that all read-only variables should be shared.
It is erroneous to assume that getting correct output when running parallel code on 1 thread implies the implementation is correct. In fact, barring disastrous use of OpenMP, you are incredibly unlikely to get the wrong output with only 1 thread. Testing with multiple threads reveals that your previous implementation was not correct.
Hope this helps.
EDIT: The output I get is the same as yours for a 5x5 matrix.

Related

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

OpenMP: How to utilize recursive function in each thread?

#include <stdio.h>
#include<array>
#include<vector>
#include <omp.h>
std::vector<int> pNum;
std::array<int, 4> arr;
int pGen(int);
int main()
{
pNum.push_back(2);
pNum.push_back(3);
pGen(10);
for (int i = 0; i < pNum.size(); i++)
{
printf("%d \n", pNum[i]);
}
printf("top say: %d", pNum.size());
getchar();
}
int pGen(int ChunkSize)
{
//
if (pNum.size() == 50) return 0;
int i, k, n, id;
int state = 0;
//
#pragma omp parallel for schedule(dynamic) private(k, n, id) num_threads(4)
for (i = 1; i < pNum.back() * pNum.back(); i++)
{
//
id = omp_get_thread_num();
n = pNum.back() + i * 2;
for (k = 1; k < pNum.size(); k++)
{
//
if (n % pNum[k] == 0) break;
if (n / pNum[k] <= pNum[k])
{
//
#pragma omp critical
{
//
if (state == 0)
{
//
state = 1; pNum.push_back(n); printf("id: %d; number: %d \n", id, n); pGen(ChunkSize); break;
}
}
}
}
if (state == 1) break;
}
}
This is my code above. I am trying to find first 50 prime number with openMP scheduling for each dynamic, static and guided. I started with dynamic. And somehow I realized I have to use recursive function since I cant use do - while in parallel structures.
When I debug the code above, console opens up and close down immediately, I can only see "id:0, number:5" and an "error: blablabla(something)"
The strange thing is I never get to getchar() and output the vector I use to store prime numbers. I think this is about recursion function. Any other theories?
edit: I happened to catch the error:
this is the error
I don't know if this is significant for your algorithm, but since you add numbers in your pNum vector during the main loop, pNum.back() will change over iterations. Therefore, the boundaries of the parallelised loop will change during the loop itself: for (i = 1; i < pNum.back() * pNum.back(); i++)
This isn't supported by OpenMP. Loops can only be parallelised with OpenMP if they are in Canonical Loop Form. The link explains it in details, but it boils down for you that the boundaries should be known and fixed prior to entering the loop:
lb and b: Loop invariant expressions of a type compatible with the type of var
Therefore, your code has an Undefined Behaviour. It may or may not compile, may or may not run and can give whatever result if any (or just reformat your hard drive).
If it is not important that pNum.back() evolves over iterations, then you can simply evaluate it prior to the loop and use this value as upper bound in the for statement. But if it is important, then you'll have to find another method to parallelise your loop.
Finally, a side note: this algorithm uses nested parallelism, but you didn't explicitly allow it so, as the nested parallelism is disabled by default, only the outermost call to pGen() will generate OpenMP threads.

How to apply openMP to a C++ function to validate all rows of a sudoku puzzle solution?

I am designing a program that will test to see whether a valid sudoku puzzle solution is given to the program or not. I first designed it in C++ but now I want to try to make it parallel. The program compiles fine without errors.
First I had to figure out a way to deal with using a return statement inside of a structured block. I just decided to make an array of bool's that are initialized to true. However the output from this function is false and I know for a fact the solution I am submitting is true. I am new to openMP and was wondering if anyone could help me out?
I have a feeling the issue is with my variable a getting set back to 0 and maybe also with my other variable nextSudokuNum getting set back to 1.
bool test_rows(int sudoku[9][9])
{
int i, j, a;
int nextSudokuNum = 1;
bool rowReturn[9];
#pragma omp parallel for private(i)
for(i = 0; i < 9; i++)
{
rowReturn[i] = true;
}
#pragma omp parallel for private(i,j) \
reduction(+: a, nextSudokuNum)
for(i = 0; i < 9; i++)
{
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
nextSudokuNum = 1;
}
for(i = 0; i < 9; i++)
{
if(rowReturn[i] == false) {
cout << "Invalid Sudoku Solution(Next Valid Sudoku Number Not Found)" << endl;
cout << "Check row " << (i+1) << endl;
return false;
}
}
cout << "Valid sudoku rows(Returning true)" << endl;
return true;
}
Disclaimer:
First off, do not parallelize very small loops or loops which execute nearly instantaneously. The overhead of creating the threads will dominate the benefit you would get by executing the inner statements of the loop in parallel. So unless each iteration you are parallelizing performs thousands-millions of FLOPs, the serial version of the code will run faster than the parallel version of the code.
Therefore, a better plan for parallelizing your (probable) tasks is to parallelize at a higher level. That is, presumably you are calling test_rows(sudoku), test_columns(sudoku), and test_box(sudoku) from one function somewhere else. What you can do is call these three serial functions in parallel using OpenMP sections where calling each of these three functions is a separate OpenMP section. This will only benefit from using 3 cores of your CPU, but presumably you are doing this on your laptop anyway so you probably only have 2 or 4 anyway.
Now to your actual problems:
You are not parallelizing over j, but merely over i. Therefore, you can see that your variable nextSudokuNum is not being reduced; for every i iteration, nextSudokuNum is self-contained. Thus it should be initialized inside the loop and made private in the #pragma omp parallel clause.
Likewise, you are not performing a reduction over a either. For every iteration of i, a is set, compared to, and incremented internally. Again it should be a private variable.
Therefore, your new code should look like:
#pragma omp parallel for private(i,j,a,nextSudokuNum)
for(i = 0; i < 9; i++)
{
// all private variables must be set internal to parallel region before being used
nextSudokuNum = 1;
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
}

OpenMP even/odd decomposition of a nested loop

I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.
Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.

What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...