OpenMP code C++ is slower thatn c++ - c++

i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier

The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.

Related

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

Parallelizing two for loops with OpenMP in C++ does not give better performance

I have an issue with parallelizing two for loops with OpenMP in C++. I have a memberfunction CallFunction(i,j) which sets for every i and j independent member variables to a specific value and returns a weighted sum of this values. Because these functions are independent for different combinations of i and j, I want to parallelize this process. I tried it in the following way:
double optimal_value = 0;
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
if(i == j) continue;
optimal_value += CallFunction(i,j);
}
}
Above code does not have a significant effect on my runtime. I achieve almost the same runtime with and without "#pragma omp parallel for". Would it be better to write the nested loop as one loop and parallelize it? I have to idea how to make it work. Do I need further commands or settings except for activated openmp?
My system is running with a dual core cpu.
Would you please help me how I have to do it right?
Many thanks in advance!
Here is the parallelization of two loops
double optimal_value = 0;
double begin = omp_get_wtime();
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
num_tr = omp_get_num_threads();
double optimal_value_in = 0.0;
#pragma omp parallel for reduction(+:optimal_value_in)
for (int j = 0; j < n; j++)
{
if((i == j)) continue;
optimal_value_in += CallFunction(i,j);
}
optimal_value += optimal_value_in;
}
double end = omp_get_wtime();
double elapsed_secs = double(end - begin);
cout<<"############# "<<"Using #Threads "<<num_tr<<endl;
cout<<"############# "<<optimal_value<<" Time For Parallel Execution :: "<<elapsed_secs<<endl;
The thing here is (also mentioned above in comments by others) ... I am not sure if you will see some speedup with just n=25 with the body of CallFunction as
double CallFunction(int i, int j){
return i*j;
}
with n=250000 and with 8 threads, I got a speed up of 4.43 so it will strongly depend on what is done in CallFunction.

Find max element in array OpenMP and PPL versions run much slower than serial code

I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.
With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):
serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)
Here's the code:
PPL:
float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
float max_element;
int row, col;
concurrency::combinable<float> locals([] { return (float)INT_MIN; });
concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
{
float &localMax = locals.local();
localMax = max<float>(localMax, curr);
});
max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
return max_element;
}
OpenMP:
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for private(i) shared(max_value, index)
for (i = 0; i < dims * dims; ++i)
{
#pragma omp critical
if (m[i] > max_value)
{
max_value = m[i];
index = i;
}
}
//row = index / dims;
//col = index % dims;
return max_value;
}
What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?
So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously.
Removing critical section did the trick.
As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.
Update
So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for
for (i = 0; i < dims * dims; ++i)
{
if (m[i] > max_value)
{
#pragma omp critical
max_value = m[i];
}
}
return max_value;
}
PS: not tested.

OpenMP parallel for inside do-while

I am trying to refactor a OpenMP-based program and encountered a terrible scalability issue. The following (obviously not very meaningful) OpenMP program seems to reproduce the problem. Of course, the tiny sample code can be rewritten as a nested for-loop and using collapse(2) almost perfect scalability can be achieved. However, the original program I am working on does not allow to do that.
Therefore, I am looking for a fix, the keeps the do-while structure. From my understanding, OpenMP should be smart enough to keep the threads alive between the iterations and I expected good scalability. Why is this not the case?
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
int iter = 0;
do {
#pragma omp parallel for reduction(max:max) schedule(static)
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < MAX_ITER);
printf("max=%f\n", max);
}
I have measured the following runtimes with Cray compiler Version 8.3.4.
OMP_NUM_THREADS=1 : 0m21.535s
OMP_NUM_THREADS=2 : 0m12.191s
OMP_NUM_THREADS=4 : 0m9.610s
OMP_NUM_THREADS=8 : 0m9.767s
OMP_NUM_THREADS=16: 0m13.571s
This seems to be similar to this question. Thanks in advance. Help is appreciated! :)
Your could go for something like this:
#include <stdio.h>
#include <float.h>
#include <omp.h>
#define MAX( a, b ) ((a)>(b))?(a):(b)
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
#pragma omp parallel reduction( max : max )
{
int iter = 0;
int nbth = omp_get_num_threads();
int tid = omp_get_thread_num();
int myMaxIter = MAX_ITER / nbth;
if ( tid < MAX_ITER % nbth ) myMaxIter++;
int chunk = N / nbth;
do {
#pragma omp for schedule(dynamic,chunk) nowait
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < myMaxIter);
}
printf("max=%f\n", max);
}
I'm pretty sure scalability should improve notoriously.
NB: I had to come back to this a few times since I realised that the number of iterations for the outer loop (the do-while one) being potentially different for the different threads, it was of crucial importance that the scheduling of the omp for loop wasn't static, otherwise, there was a potential for deadlock at the last iteration.
I did a few tests and I think that the proposed solution is both safe and effective.

Parallel program using openMP

I am trying to calculate the integral of 4/(1+x^2) from 0 to 1 in c++ with multi-threading using openMP.
I took a serial program (which is correct) and changed it.
My idea is:
Assume that X is the number of threads.
Divide the area beneath the function into X parts, first from 0 to 1/X, 1/X to 2/X...
Each thread will calculate it's area, and I will sum it all up.
This is how I implemented it:
`//N.o. of threads to do the task
cout<<"Enter num of threads"<<endl;
int num_threads;
cin>>num_threads;
int i; double x,pi,sum=0.0;
step=1.0/(double)num_steps;
int steps_for_thread=num_steps/num_threads;
cout<<"Steps for thread : "<<steps_for_thread<<endl;
//Split to threads
omp_set_num_threads(num_threads);
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
thread_id++;
if (thread_id == 1)
{
double sum1=0.0;
double x1;
for(i=0;i<num_steps/num_threads;i++)
{
x1=(i+0.5)*step;
sum1 = sum1+4.0/(1.0+x1*x1);
}
sum+=sum1;
}
else
{
double sum2=0.0;
double x2;
for(i=num_steps/thread_id;i<num_steps/(num_threads-thread_id+1);i++)
{
x2=(i+0.5)*step;
sum2 = sum2+4.0/(1.0+x2*x2);
}
sum+=sum2;
}
} '
Explanation:
The i'th thread will calculate the area between i/n to (i+1)/n and add it to the sum.
The problem is that not only that the output is wrong, but also each time I run the program I get different output.
Any help will be welcomed
Thanks
You're making this problem much harder than it needs to be. One of OpenMP's goals is to not have to change your serial code. You usually only need to add some pragma statements. So you should write the serial method first.
#include <stdio.h>
double pi(int n) {
int i;
double dx, sum, x;
dx = 1.0/n;
#pragma omp parallel for reduction(+:sum) private(x)
for(i=0; i<n; i++) {
x = i*dx;
sum += 1.0/(1+x*x);
}
sum *= 4.0/n;
return sum;
}
int main(void) {
printf("%f\n",pi(100000000));
}
Output: 3.141593
Notice that in the function pi the only difference between the serial code and the parallel version is the statement
#pragma omp parallel for reduction(+:sum) private(x)
You should also not normally worry about setting the number of threads.