OpenMP collapse gives wrong results - c++

I have an 3D array z, where every element has the value 1.
Now I do:
#pragma omp parallel for collapse(3) shared(z)
for (int i=0; i < SIZE; ++i) {
for (int j=0; j < SIZE; ++j) {
for (int k=0; k < SIZE; ++k) {
for (int n=0; n < ITERATIONS-1; ++n) {
z[i][j][k] += 1;
}
}
}
}
This should add ITERATIONS to each element and it does. If I then change the collapse(3) to collapse(4) (because there are 4 for-loops) I don't get the right result.
Shouldn't I be able to collapse all four loops?

The issue is that the 4th loop isn't parallelisable the same way the 3 first are. Just to convince yourself, look at it with only the last loop in mind. It would become:
int zz = z[i][j][k];
for (int n=0; n < ITERATIONS-1; ++n) {
zz += 1;
}
z[i][j][k] = zz;
In order to parallelise it, you would need to add a reduction(+:zz) directive, right?
Well, same story for your collapse(4). But adding reduction(+:z), if all possible which I'm not sure, would raise some issues:
The reduction clause for arrays in C or C++ is only supported for OpenMP 4.5 onwards, and I don't know of any compiler supporting it at the moment (although I'm sure some do).
It would probably make the code much slower anyway, due to the complex mechanism of managing the reduction aspect.
So bottom line is: just stick to collapse(3) or less as you need, or parallelise you loop differently.

Related

How to calculate Matrix efficiently in C++?

I am new to C++ and programming so I think I am making inefficient codes.
I was wondering whether there is any way I can speed up the matrix calculation process.
For example, this is the sample code I write which finds the maximum differences(in absolute value) between 3d array 'V' and 'Vnew'.
First, I take subtraction.
And then, I put the value of tempdiff[0][0][0] to 'dif'
Then, I compare 'dif' and tempdiff[i][j][k] and replace if the latter is larger than the former.
This is just a part of my code and there are lots of matrix calculations inside so that I have too many 'for' statements.
So I was wondering whether there is any way I could avoid using 'for' in the matrix calculations.
Thanks in advance.
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
tempdiff[i][j][k] = abs(V[i][j][k] - Vnew[i][j][k]);
}
}
}
dif = tempdiff[0][0][0];
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
if (tempdiff[i][j][k] > dif) {
dif = tempdiff[i][j][k];
}
else {
dif = dif;
}
}
}
}
There's not much you can do with the for loops, as the maximum difference can locate at all possible places. You have already succeeded in iterating the array in the correct, linear, order.
Compilers are generally quite efficient in optimising, but they apparently fail to flatten a contiguous array, such as float V[Na][Nd][Ny];. After you flatten it manually to float V[Na*Nd*Ny], at least clang can auto-vectorise and produce SIMD code for x64 and arm.
A further optimisation is to avoid making this in two steps, as the total memory throughput is exactly doubled with the temporary array compared to a one-pass solution.
I was assuming your matrices are of type float -- if you can select int, gcc can auto-vectorise this as well (relates to NaN handling); furthermore int16_t or int8_t types are even quicker to evaluate, as more operations can be packed to a single SIMD instruction.

open mp three for loops with reduction

i need to multiply two 10x10 matrices using open mp. I decided to split the rows of one matrice into groups of 3rows,3 rows and 4 rows. how do i fix this code for the first three rows ?
#pragma omg parallel for reduction(+:m[p][q])
{
for (p = 0; p < 3; p++)
for (q = 0; q < 10; q++)
for (k = 0; k < 10; ++k)
{
m[p][q] += l[p][k] * o[k][q];
}
}
For a start - don't split the matrix yourself, but let OpenMP take care of sharing the work in the loops, e.g.
#pragma omg parallel for
{
for (p = 0; p < 10; p++)
for (q = 0; q < 10; q++)
for (k = 0; k < 10; ++k)
{
m[p][q] += l[p][k] * o[k][q];
}
}
In this code there is no need for a reduction because all concurrent write operations happen to different elements of m. Even if you collapse(2) the first two loops, you are still fine in that regard.
That said, optimizing matrix multiplication is an immensely complex topic on modern hardware. Parallelizing it even more so. If you want to get performance, use a BLAS implementation that is optimized for your architecture. If you want to learn - I suggest you start with the serial implementation and then go on parallelizing it. There plenty of educational material available for either.

How to optimally parallelize nested loops?

I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).
The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.
How would you parallelize a series of nested loops like these:
for(int i = 0; i < 3; ++i){
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
switch(i){
case 0:
matrix[j][k].a = matrix[j][k] * someValue1;
break;
case 1:
matrix[j][k].b = matrix[j][k] * someValue2;
break;
case 2:
matrix[j][k].c = matrix[j][k] * someValue3;
break;
}
}
}
}
HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
matrix is an array of custom structs with attributes a, b and c
someValue is a double
I know that OpenMP is not always good for nested loops but any help is welcome.
[UPDATE]:
So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.
#pragma omp parallel
{
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
}
[UPDATE 2]
Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.
Serial: ~130 milliseconds
Loop unrolling: ~49 ms
Collapsing two innermost loops: ~55 ms
Parallel outermost loop: ~83 ms
What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?
The problem with OpenMP is that it's very high-level, meaning that you can't access low-level functionality, such as spawning the thread, and then reusing it. So let me make it clear what you can and what you can't do:
Assuming you don't need any mutex to protect against race conditions, here are your options:
You parallelize your outer-most loop, and that will use 3 threads, and that's the most peaceful solution you're gonna have
You parallelize the first inner loop with, and then you'll have a performance boost only if the overhead of spawning a new thread for every WIDTH element is much smaller the efforts required to perform the most inner loop.
Parallelizing the most inner loop, but this is the worst solution in the world, because you'll respawn the threads 3*HEIGHT times. Never do that!
Not use OpenMP, and use something low-level, such as std::thread, where you can create your own Thread Pool, and push all the operations you want to do in a queue.
Hope this helps to put things in perspective.
Here's another option, one which recognises that distributing the iterations of the outermost loop when there are only 3 of them might lead to very poor load balancing,
i=0
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=1
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=2
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
Warning -- check the syntax yourself, this is no more than a sketch of manual loop unrolling.
Try combining this and collapsing the j and k loops.
Oh, and don't complain about code duplication, you've told us you're being scored partly on performance improvements.
You probably want to parallelize this example for simd so the compiler can vectorize, collapse the loops because you use j and k only in the expression matrix[j][k], and because there are no dependencies on any other element of the matrix. If nothing modifies somevalue1, etc., they should be uniform. Time your loop to be sure those really do improve your speed.

Parallelize counting for-loops with openmp

I have an 2d-image where I want to count all colors and store the result in an array. I know the number of colors, so I can set the size of the array before. My problem now is that the counting lasts too long for me. How can I speed the counting up with OpenMP?
My current serial code is
std::vector<int> ref_color_num_thread;
ref_color_num.resize(ref_color.size());
std::fill(ref_color_num.begin(), ref_color_num.end(), 0);
ref_color_num_thread.resize(ref_color.size());
std::fill(ref_color_num_thread.begin(), ref_color_num_thread.end(), 0);
for (int i = 0; i < image.width(); i++)
{
for (int j = 0; j < image.height(); j++)
{
for (int k = 0; k < (int)ref_color.size(); k++)
{
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread[k]++;
}
}
}
First approaches were setting #pragma omp parallel for at each loop (each try at another), but everytime I get a program crash because of wrong memory access. Do I have to use private() for my vector?
What you're doing is filling a histogram of your colors. This is equivalence to doing an array reduction in C/C++ with OpenMP. In C/C++ OpenMP does not have built in support for this (but it does in Fortran due to the fact that the array size is known in Fortran where in C/C++ it's only known for static arrays). However, it's easy to do an array reduction in C/C++ with OpenMP yourself.
#pragma omp parallel
{
std:vector<int> ref_color_num_thread_private(ref_color.size(),0);
#pragma omp for
for (int i = 0; i < image.width(); i++) {
for (int j = 0; j < image.height(); j++) {
for (int k = 0; k < (int)ref_color.size(); k++) {
if (image(i, j, 0, 0) == ref_color[k].R && image(i, j, 0, 1) == ref_color[k].G && image(i, j, 0, 2) == ref_color[k].B)
ref_color_num_thread_private[k]++;
}
}
}
#pragma omp critical
{
for(int i=0; i<(int)ref_color.size(); i++) {
ref_color_num_thread[i] += ref_color_num_thread_private[i];
}
}
}
I went into a lot more detail about his here Fill histograms (array reduction) in parallel with OpenMP without using a critical section
I showed how to an array reduction without a critical section but it's a lot more tricky. You should test the first case and see if it works well for you first. As long as the number of colors (ref_color.size()) is small compared to the number of pixels it should parallelize well. Otherwise, you might need to try the second case without a critical section.
There is a race condition if one of the outer two loops (i or j) are parallized, because the inner loop iteratates over the vector (k). I think your crash is because of that.
You have to restructure your program. It is not trivial, but one idea is that each thread uses a local copy of the ref_color_num_thread vector. Once the computation is finished, you can sum up all the vectors.
If k is large enough to provide enough parallelism, you could exchange the loops. Instead of "i,j,k" you could iterate in the order "k,i,j". If I'm not mistaken, there are no violated dependencies. Then you can parallelize the outer k loop, and let the inner i and j loops execute sequentially.
Update:
pragma omp for also supports reductions, for example:
#pragma omp for reduction(+ : nSum)
Here is a link to some documentation.
Maybe that can help you to restructure your program.

OpenMP nested for, unequal num. of iterations

I am using OpenMP to parallelize loops. In normal case, one would use:
#pragma omp for schedule(static, N_CHUNK)
for(int i = 0; i < N; i++) {
// ...
}
For nested loops, I can put pragma on the inner or outter loop
#pragma omp for schedule(static, N_CHUNK) // can be here...
for(int i = 0; i < N; i++) {
#pragma omp for schedule(static, N_CHUNK) // or here...
for(int k = 0; k < N; k++) {
// both loops have consant number of iterations
// ...
}
}
But! I have two loops, where number of iterations in 2nd loop depends on the 1st loop:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
What is the best way to balance CPU usage for this kind of loop?
As always:
it depends
profile.
In this case: see also OMP_NESTED environment variable
The things that are going to make the difference here are not being shown:
(non)linear memory addressing (also watch the order of the loops
use of shared variables;
As to your last scenario:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
I suggest parallelizing the outer loop for the following reasons:
all other things being equal coarse grained parallelizing usually leads to better performance due to
increased cache locality
reduced frequency of locking required
(note that this hinges on assumptions about the loop contents that I can't really make; I'm basing it on my experience of /usual/ parallelized code)
the inner loop might become so short as to be inefficient to parallelize (IOW: the outer loop's range is predictable, the inner loop less so, or doesn't lend itself to static scheduling as well)
nested parallellism rarely scales well
sehe's points -- especially "it depends" and "profile" -- are extremely to the point.
Normally, though, you wouldn't want to have the nested parallel loops as long as the outer loop is big enough to keep all cores busy. The added overhead of another parallel section inside a loop is probably more cost than the benefit from the additional small pieces of work.
The usual way to tackle this is just to schedule the outer loop dynamically, so that the fact that each loop iteration takes a different length of type doesn't cause load-balancing issues (as the i==N-1 iteration completes almost immediately while the i==0 iteration takes forever)
#pragma omp parallel for default(none) shared(N) schedule(dynamic)
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
The collapse pragma is very useful for essentially getting rid of the nesting and is particularly valuable if the outer loop is small (eg, N < num_threads):
#pragma omp parallel for default(none) shared(N) collapse(2)
for(int i = 0; i < N; i++) {
for(int k = 0 ; k < N; k++) {
}
}
This way the two loops are folded into one and there is fewer chunking which means less overhead. But that won't work in this case, because the loop ranges aren't fixed; you can't collapse a loop whose loop bounds change (eg, with i).