dot product of complex vectors with openMP - c++

I'm using a version of openMP which does not support reduce() for complex argument. I need a fast dot-product function like
std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )
{
std::complex< double > sum=0.;
int i;
# pragma omp parallel shared(sum)
# pragma omp for
for (i=0; i<dim;i++ )
{
#pragma omp critical
{
sum+=std::conj<double>(v1[i])*v2[i];
}
}
return sum;
}
Obviously this code does not speed up the problem but slows it down. Do you have a fast solution without using reduce() for complex arguments?

Each thread can calculate the private sum as the first step and as the second step it can be composed to the final sum. In that case the critical section is only needed in the final step.
std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )
{
std::complex< double > sum=0.;
int i;
# pragma omp parallel shared(sum)
{
std::complex< double > priv_sum = 0.;
# pragma omp for
for (i=0; i<dim;i++ )
{
priv_sum += std::conj<double>(v1[i])*v2[i];
}
#pragma omp critical
{
sum += priv_sum;
}
}
return sum;
}

Try doing the multiplications in parallel, then sum them serially:
template <typename T>
std::complex<T> dot_prod(std::complex<T> *a, std::complex<T> *b, size_t dim)
{
std::vector<std::complex<T> > prod(dim); // or boost::scoped_array + new[]
#pragma omp parallel for
for (size_t i=0; i<dim; i++)
// I believe you had these reversed
prod[i] = a[i] * std::conj(b[i]);
std::complex<T> sum(0);
for (size_t i=0; i<dim; i++)
sum += prod[i];
return sum;
}
This does require O(dim) working memory, of course.

Why not have the N threads compute N individual sums. Then at the end you only need to sum the N sums, which can be done serially, as N is quite small. Although I don't know how to accomplish that with OpenMP, at the moment (I don't have any experience with it), I'm quite sure this is easily achievable.

Related

parallel programming multiplying two arrays of numbers

I have the following C++ code that multiply two array elements of a large size count
double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
lg += *pA1-- * *pA2--;
}
Is there a way that I can implement parallelism for the code?
Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3. Clang 13.0 fail to do this, even with -ffast-math and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.
I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.
// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
int i;
# pragma omp parallel shared(sum)
{
int num = omp_get_num_threads();
int id = omp_get_thread_num();
printf("I am thread # % d of % d.\n", id, num);
double priv_sum = 0.;
# pragma omp for
for (i = 0; i < dim; i++)
{
priv_sum += v1[i] * v2[i];
}
#pragma omp critical
{
cout << "priv_sum = " << priv_sum << endl;
sum += priv_sum;
}
}
return sum;
}

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

OpenMP code C++ is slower thatn c++

i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier
The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.

Find max element in array OpenMP and PPL versions run much slower than serial code

I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.
With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):
serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)
Here's the code:
PPL:
float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
float max_element;
int row, col;
concurrency::combinable<float> locals([] { return (float)INT_MIN; });
concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
{
float &localMax = locals.local();
localMax = max<float>(localMax, curr);
});
max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
return max_element;
}
OpenMP:
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for private(i) shared(max_value, index)
for (i = 0; i < dims * dims; ++i)
{
#pragma omp critical
if (m[i] > max_value)
{
max_value = m[i];
index = i;
}
}
//row = index / dims;
//col = index % dims;
return max_value;
}
What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?
So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously.
Removing critical section did the trick.
As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.
Update
So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for
for (i = 0; i < dims * dims; ++i)
{
if (m[i] > max_value)
{
#pragma omp critical
max_value = m[i];
}
}
return max_value;
}
PS: not tested.

Parallel program using openMP

I am trying to calculate the integral of 4/(1+x^2) from 0 to 1 in c++ with multi-threading using openMP.
I took a serial program (which is correct) and changed it.
My idea is:
Assume that X is the number of threads.
Divide the area beneath the function into X parts, first from 0 to 1/X, 1/X to 2/X...
Each thread will calculate it's area, and I will sum it all up.
This is how I implemented it:
`//N.o. of threads to do the task
cout<<"Enter num of threads"<<endl;
int num_threads;
cin>>num_threads;
int i; double x,pi,sum=0.0;
step=1.0/(double)num_steps;
int steps_for_thread=num_steps/num_threads;
cout<<"Steps for thread : "<<steps_for_thread<<endl;
//Split to threads
omp_set_num_threads(num_threads);
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
thread_id++;
if (thread_id == 1)
{
double sum1=0.0;
double x1;
for(i=0;i<num_steps/num_threads;i++)
{
x1=(i+0.5)*step;
sum1 = sum1+4.0/(1.0+x1*x1);
}
sum+=sum1;
}
else
{
double sum2=0.0;
double x2;
for(i=num_steps/thread_id;i<num_steps/(num_threads-thread_id+1);i++)
{
x2=(i+0.5)*step;
sum2 = sum2+4.0/(1.0+x2*x2);
}
sum+=sum2;
}
} '
Explanation:
The i'th thread will calculate the area between i/n to (i+1)/n and add it to the sum.
The problem is that not only that the output is wrong, but also each time I run the program I get different output.
Any help will be welcomed
Thanks
You're making this problem much harder than it needs to be. One of OpenMP's goals is to not have to change your serial code. You usually only need to add some pragma statements. So you should write the serial method first.
#include <stdio.h>
double pi(int n) {
int i;
double dx, sum, x;
dx = 1.0/n;
#pragma omp parallel for reduction(+:sum) private(x)
for(i=0; i<n; i++) {
x = i*dx;
sum += 1.0/(1+x*x);
}
sum *= 4.0/n;
return sum;
}
int main(void) {
printf("%f\n",pi(100000000));
}
Output: 3.141593
Notice that in the function pi the only difference between the serial code and the parallel version is the statement
#pragma omp parallel for reduction(+:sum) private(x)
You should also not normally worry about setting the number of threads.