Why omp version is slower than serial?

Why omp version is slower than serial? - c++

It's a follow-up question to this one
Now I have the code:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a, b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
const int p = 4;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
double maxdiffA[p + 1];
int icol, jrow;
int main() {
omp_set_num_threads(p);
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
#pragma omp parallel for private(icol) shared(x, y, v, _new)
for (icol = 0; icol <= n + 1; ++icol) {
x[icol] = y[icol] = icol * h;
_new[icol][0] = v[icol][0] = 6 - 2 * x[icol];
_new[n + 1][icol] = v[n + 1][icol] = 4 - 2 * y[icol];
_new[icol][n + 1] = v[icol][n + 1] = 3 - x[icol];
_new[0][icol] = v[0][icol] = 6 - 3 * y[icol];
}
const double eps = 0.01;
#pragma omp parallel private(icol, jrow) shared(_new, v, maxdiffA)
{
while (true) { //for [iters=1 to maxiters by 2]
#pragma omp single
for (int i = 0; i < p; i++) maxdiffA[i] = 0;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
_new[icol][jrow] =
(v[icol - 1][jrow] + v[icol + 1][jrow] + v[icol][jrow - 1] + v[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
v[icol][jrow] = (_new[icol - 1][jrow] + _new[icol + 1][jrow] + _new[icol][jrow - 1] +
_new[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
maxdiffA[omp_get_thread_num()] = max(maxdiffA[omp_get_thread_num()],
fabs(_new[icol][jrow] - v[icol][jrow]));
#pragma omp barrier
double maxdiff = 0.0;
for (int k = 0; k < p; ++k) {
maxdiff = max(maxdiff, maxdiffA[k]);
}
if (maxdiff < eps)
break;
#pragma omp barrier
//#pragma omp single
//std::cout << maxdiff << std::endl;
}
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
But why it works 2-3 times slower (32sec vs 18sec) than serial analog:
#include <iostream>
#include <cmath>
#include <omp.h>
#define max(a,b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 2000;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
int main() {
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
for (int i = 0; i <= n + 1; ++i) {
x[i] = y[i] = i * h;
_new[i][0]=v[i][0] = 6 - 2 * x[i];
_new[n + 1][i]=v[n + 1][i] = 4 - 2 * y[i];
_new[i][n + 1]=v[i][n + 1] = 3 - x[i];
_new[0][i]=v[0][i] = 6 - 3 * y[i];
}
const double eps=0.01;
while(true){ //for [iters=1 to maxiters by 2]
double maxdiff=0.0;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
_new[i][j]=(v[i-1][j]+v[i+1][j]+v[i][j-1]+v[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
v[i][j]=(_new[i-1][j]+_new[i+1][j]+_new[i][j-1]+_new[i][j+1])/4;
for (int i=1;i<=n;i++)
for (int j=1;j<=n;j++)
maxdiff=max(maxdiff, fabs(_new[i][j]-v[i][j]));
if(maxdiff<eps) break;
std::cout << maxdiff<<std::endl;
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
Also interesting that it works SAME TIME as version (I can post it here if you say so) which looks like so
while(true){ //106 iteratins here!!!
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
#pragma omp paralell for
for(...)
}
But I thought that what making omp code slow is spawning threads inside while loop 106 times... But no! Then probably threads simultaneously write to the same array cells.. But where does it happen? I don't see it could you show me please?
Maybe it's because too much barriers? But Lecturer told me to implement the code like so and "analyse it" Maybe the answer is "Jacobi algorithm isn't meant to run well in parallel"? Or it's just my lame coding?

So the root of evel was
max(maxdiffA[w],fabs(_new[icol][jrow] - v[icol][jrow]))
because it's
#define max(a, b) (a)>(b)?(a):(b)
It's probably creating TOO much branching ('if's ) Without this thing parallel version works 8 times faster and loading CPU 68% instead of 99%..
The starange thing: same "max" doesn't affect serioal version

I am writing to make you aware of a few situations. It is not short to write in a comment, so I decided to write as an answer.
every time a thread is made, it takes some time for its creation. if your program's running time in a single core is short, then the creation of threads will make this time longer for multi-core.
plus using a barrier makes all your threads wait for others, which could somehow be slowed down in cpu. this way, even if all threads finish the job very fast, that last one will make the total run time longer.
try to run your program with bigger sized arrays where time is around 2 minutes for single threading. then make your way to multi-core.
then try to wrap your main code in a normal loop to run it a few times and prints the timings for each. the first run of the loop might be slow because of loading libraries, but the next runs should be faster to prove the increasing speed.
if above suggestions do not give a result, then it means your coding needs more editing.
EDIT:
To downvoters, If you don't like a post, please at least be polite and leave a comment. Or better, give your own answer so be helpful to community.

Related

How to parallelize accumulative probability function with OpenMP?

I'm trying to make more efficient via parallelizing my code that calculates the accumulative probability function. I have a vector<double> of radii called r and I need to count how many elements there are with a radius bigger than a given one > R. In addition, I need to calculate the accumulative probability function for the volume.
The code I have is the following one:
int i, j;
double aux, contar, contar1, aux;
vector<double> r, contador, contador1, vol,
for (i = 0; i != r.size() - 1; i++)
{
aux = r[i];
contador[i] = 0;
contador1[i] = 0;
contar = 0;
contar1 = 0;
vol[i] = 0.0;
for (j = 0; j != r.size() - 1; j++)
{
if(aux <= r[j])
{
contar++;
#pragma omp atomic write
vol[i] = vol[i] + 4.0 * 3.141592653589793 * r[j] * r[j] * r[j] / 3.0;
}
if(aux==r[j])
{
contar1++;
}
}
#pragma omp atomic write
contador[i]=contar;
#pragma omp atomic write
contador1[i]=contar1;
}
but it's not efficient at all. Any help in order to make it more efficient with OpenMP?

C++ openmp parallel computation calculates wrong results

I have an algorithm that I want to execute in parallel using openmp. I could validate the results for single-threaded execution (OMP_NUM_THREADS=1), but the results are slightly different as soon as I set the number of threads to 2 or higher. I also tried parallelizing the inner for-loops, but that doesn't yield correct results, either.
I'm fairly new to openmp and multi-threading in general. I suspect that my implementation shares variables between threads improperly somehow, but I can't figure it out.
extern "C" double *lomb_scargle(double *x, double *y, double *f, int NX, int NF) {
double *result = (double*) malloc(2 * NF * sizeof(double));
double w, tau, SS, SC, SST1, SST2, SCT1, SCT2, Ai, Bi;
int i, j;
#pragma omp parallel for
for (i=0; i<NF; i++) {
w = 2 * M_PI * f[i];
SS = 0.;
SC = 0.;
for (j=0; j<NX; j++) {
SS += sin(2*w*x[j]);
SC += cos(2*w*x[j]);
}
tau = atan2(SS, SC) / (2 * w);
SCT1 = 0.;
SCT2 = 0.;
SST1 = 0.;
SST2 = 0.;
for (j=0; j<NX; j++) {
SCT1 += y[j] * cos(w * (x[j] - tau));
SCT2 += pow(cos(w * (x[j] - tau)),2);
SST1 += y[j] * sin(w * (x[j] - tau));
SST2 += pow(sin(w * (x[j] - tau)),2);
}
Ai = SCT1 / SCT2;
Bi = SST1 / SST2;
// result contains the amplitude first, and then the phase
result[i] = sqrt(Ai*Ai + Bi*Bi);
result[i+NF] = - (atan2(Bi, Ai) + w * tau);
}
return result;
}
EDIT: typo

By default, OpenMP shares all variables declared in an outer scope between all workers. You need to move the temporary variables into the inner block (or declare them private).

Removing single construct results in incorrect execution

This code works as expected:
#include <iostream>
#include <cmath>
#include <omp.h>
//https://stackoverflow.com/questions/37970024/jacobi-relaxation-in-mpi
#define max(a, b) (a)>(b)?(a):(b)
const int m = 2001;
const int n = 1500;
const int p = 4;
double v[m + 2][m + 2];
double x[m + 2];
double y[m + 2];
double _new[m + 2][m + 2];
double maxdiffA[p + 1];
int icol, jrow;
int main() {
omp_set_num_threads(p);
double h = 1.0 / (n + 1);
double start = omp_get_wtime();
#pragma omp parallel for private(icol) shared(x, y, v, _new)
for (icol = 0; icol <= n + 1; ++icol) {
x[icol] = y[icol] = icol * h;
_new[icol][0] = v[icol][0] = 6 - 2 * x[icol];
_new[n + 1][icol] = v[n + 1][icol] = 4 - 2 * y[icol];
_new[icol][n + 1] = v[icol][n + 1] = 3 - x[icol];
_new[0][icol] = v[0][icol] = 6 - 3 * y[icol];
}
const double eps = 0.01;
#pragma omp parallel private(icol, jrow) shared(_new, v, maxdiffA)
{
while (true) { //for [iters=1 to maxiters by 2]
#pragma omp single
for (int i = 0; i < p; i++) maxdiffA[i] = 0;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
_new[icol][jrow] =
(v[icol - 1][jrow] + v[icol + 1][jrow] + v[icol][jrow - 1] + v[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
v[icol][jrow] = (_new[icol - 1][jrow] + _new[icol + 1][jrow] + _new[icol][jrow - 1] +
_new[icol][jrow + 1]) / 4;
#pragma omp for
for (icol = 1; icol <= n; icol++)
for (jrow = 1; jrow <= n; jrow++)
maxdiffA[omp_get_thread_num()] = max(maxdiffA[omp_get_thread_num()],
fabs(_new[icol][jrow] - v[icol][jrow]));
#pragma omp barrier
double maxdiff = 0.0;
for (int k = 0; k < p; ++k) {
maxdiff = max(maxdiff, maxdiffA[k]);
}
if (maxdiff < eps)
break;
#pragma omp single
std::cout << maxdiff << std::endl;
}
}
double end = omp_get_wtime();
printf("start = %.16lf\nend = %.16lf\ndiff = %.16lf\n", start, end, end - start);
return 0;
}
It outputs
1.12454
<106 iterations here>
0.0100436
start = 1527366091.3069999217987061
end = 1527366110.8169999122619629
diff = 19.5099999904632568
But if I remove
#pragma omp single
std::cout << maxdiff << std::endl;
the program either seems to run infinitely long or I get
start = 1527368219.8810000419616699
end = 1527368220.5710000991821289
diff = 0.6900000572204590
Why is that so?

You overwrite maxdiffA at the beginning of the while loop - this must be isolated from reading maxdiffA at the end to check the condition. Otherwise one thread may already reset the values before another thread gets the chance to read them. The omp single construct at the end of the loop acts as isolation due to the implicit barrier at the end of omp single constructs. However there is no barrier at the beginning of omp single constructs. Also "a whole lot of code" is not a safe barrier. So if there is no valid implicit barrier, you must protect entry to the reset code with an #pragma omp barrier.
That said I highly recommend to restructure the code to have a shared exit condition that is also computed in a single construct. This makes it more clear that all threads process exit the while-loop at the same time. Otherwise the code is ill-defined.

Optimize outer loop with OpenMP and a reduction

I struggle a bit with a function. The calculation is wrong if I try to parallelize the outer loop with a
#pragma omp parallel reduction(+:det).
Can someone show me how to solve it and why it is failing?
// template<class T> using vector2D = std::vector<std::vector<T>>;
float Det(vector2DF &a, int n)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
for (int i = 0; i < n; i++)
{
int l = 0;
#pragma omp parallel for private(l)
for (int j = 1; j < n; j++)
{
l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}

If you parallelize the outer loop, there is a race condition on this line:
m[j - 1][l] = a[j][k];
Also you likely want a parallel for reduction instead of just a parallel reduction.
The issue is, that m is shared, even though that wouldn't be necessary given that it is completely overwritten in the inner loop. Always declare variables as locally as possible, this avoids issues with wrongly shared variables, e.g.:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
for (int i = 0; i < n; i++)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}
Now that is correct, but since m can be expensive to allocate, performance could benefit from not doing it in each and every iteration. This can be done by splitting parallel and for directives as such:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
#pragma omp parallel for
for (int i = 0; i < n; i++)
{
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
}
return det;
}
Now you could also just declare m as firstprivate, but that would assume that the copy constructor makes a completely independent deep-copy and thus make the code more difficult to reason about.
Please be aware that you should always include expected output, actual output and a minimal complete and verifiable example.

How to allocate big arrays on the stack?

I have implemented this code from this to vectorizing it:
int c=0;
for (int j=-halfHeight; j<=halfHeight; ++j)
{
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i){
wx_[c] = ofsx + j * a12 + i * a11;
wy_[c] = ofsy + j * a22 + i * a21;
x_[c] = (int) floor(wx_[c]);
y_[c] = (int) floor(wy_[c]);
++c;
}
}
std::cout<<"First size="<<size<<std::endl;
float imat_1[size];
std::cout<<"imat1"<<std::endl;
float imat_2[size];
std::cout<<"imat2"<<std::endl;
float imat_3[size];
std::cout<<"imat3"<<std::endl;
float imat_4[size];`
std::cout<<"imat4"<<std::endl;
#pragma omp simd
for(int c=0; c<size; c++){
if (x_[c] >= 0 && y_[c] >= 0 && x_[c] < width && y_[c] < height){
wx_[c] -= x_[c];
wy_[c] -= y_[c];
imat_1[c] = im.at<float>(y_[c],x_[c]);
imat_2[c] = im.at<float>(y_[c],x_[c]+1);
imat_3[c] = im.at<float>(y_[c]+1,x_[c]);
imat_4[c] = im.at<float>(y_[c]+1,x_[c]+1);
}
else{
wx_[c] = 0;
wy_[c] = 0;
imat_1[c] = 0;
imat_2[c] = 0;
imat_3[c] = 0;
imat_4[c] = 0;
ret = true;
}
}
std::cout<<"Second"<<std::endl;
#pragma omp simd
for(int c=0; c<size; c++){
out[c] =
(1.0f - wy_[c]) * ((1.0f - wx_[c]) * imat_1[c] + wx_[c] * imat_2[c]) +
( wy_[c]) * ((1.0f - wx_[c]) * imat_3[c] + wx_[c] * imat_4[c]);
}
In particular, size can reach up to 275625. When I run this code, it goes in segmentation fault at line float imat_4[size];. In fact, this is solved by using float *imat_4 = (float*)malloc(sizeof(float)*size);
I think that this is because of this, so we run out of memory on the stack... But then, how can I solve this? I don't see much other possibilities for vectorizing this code.
notice that performance are crucial here, so allocating on the stack is less efficient (right?)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why omp version is slower than serial? - c++

Related

How to parallelize accumulative probability function with OpenMP?

C++ openmp parallel computation calculates wrong results

Removing single construct results in incorrect execution

Optimize outer loop with OpenMP and a reduction

How to allocate big arrays on the stack?

Categories

Resources