OpenMP C++ - How to parallelize this function? - c++

I'd like to parallelize this function but I'm new with open mp and I'd be grateful if someone could help me :
void my_function(float** A,int nbNeurons,int nbOutput, float* p, float* amp){
float t=0;
for(int r=0;r<nbNeurons;r++){
t+=p[r];
}
for(int i=0;i<nbOutput;i++){
float coef=0;
for(int r=0;r<nbNeurons;r++){
coef+=p[r]*A[r][i];
}
amp[i]=coef/t;
}
}
I don't know how to parallelize it properly because of the double loop for, for the moment, I only thought about doing a :
#pragma omp parallel for reduction(+:t)
But I think it is not the best way to get the computing faster through openMp.
Thank in advance,

First of all: we need to know context. Where does your profiler tell you the most time is spent?
In general, coarse grained parallellization works best, so as #Alex said: parallellize the outer for loop.
void my_function(float** A,int nbNeurons,int nbOutput, float* p, float* amp)
{
float t=0;
for(int r=0;r<nbNeurons;r++)
t+=p[r];
#pragma parallel omp for
for(int i=0;i<nbOutput;i++){
float coef=0;
for(int r=0;r<nbNeurons;r++){
coef+=p[r]*A[r][i];
}
amp[i]=coef/t;
}
}
Depending on the actual volumes, it may be interesting to calculate t in the background, and move the division out of the parallel loop:
void my_function(float** A,int nbNeurons,int nbOutput, float* p, float* amp)
{
float t=0;
#pragma omp parallel shared(amp)
{
#pragma omp single nowait // only a single thread executes this
{
for(int r=0;r<nbNeurons;r++)
t+=p[r];
}
#pragma omp for
for(int i=0;i<nbOutput;i++){
float coef=0;
for(int r=0;r<nbNeurons;r++){
coef+=p[r]*A[r][i];
}
amp[i]=coef;
}
#pragma omp barrier
#pragma omp master // only a single thread executes this
{
for(int i=0; i<nbOutput; i++){
amp[i] /= t;
}
}
}
}
Note untested code. OMP has tricky semantics sometimes, so I might have missed a 'shared' declaration there. Nothing a profiler won't quickly notify you about, though.

Related

openmp tasks in nested loops

I am trying to write the following piece of code.
#pragma omp parallel
{
int .... some variables
for (int x:map){
int ...
#pragma omp single
{
#pragma omp task firstprivate(x,..) depend(out:a)
{
assigning the variables some values
}
for (int loop over j)
{
#pragma omp task firstprivate(j) depend (in:a) depend (out:b)
{
}
third loop over k
#pragma omp task depend(in:a,b)
{
}
}
}
}
Is this valid? The threads are getting formed but they are not getting into either (the loop over j i.e. the second loop) or the 3rd loop (I checked by print statements).
Please suggest how to correct this.
I tried printing inside the 2 loops and saw that nothing is getting printed which implies the threads don't enter into the loops at all. I was expecting that the work would be distributed among the threads but unfortunately I am unable to achieve this.
Minimum reproducible example:
(as asked in the comments I am making an example)
int a,b;
#pragma omp parallel
{
#pragma omp single
{for (int &x:map)
{
#pragma omp task
for(int i=0;i<x.second;++i)
{
vector<int> val = m2[i];
for (int j=0;j<val.size();++j)
{
#pragma omp critical
update a global map m3.
}
}
}
}

atomic inside a single construct

In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?
There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}
each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

openmp latency for inside for

I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program
#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;
int main(){
int const N=1000;
int timeSteps=100;
double delta=0.0001;
double qq[N];
Chain ch(N);
ch.initCond();
for (int t=0; t<timeSteps; t++){
ch.changeQ(delta*t);
ch.calMag_i();
ch.calForce001();
}
ch.printSomething();
}
The Chain.h is
class Chain{
public:
int N;
double *q;
double *mx;
double *my;
double *force;
Chain(int const Np);
void initCond();
void changeQ(double delta);
void calMag_i();
void calForce001();
};
And the Chain.cpp is
Chain::Chain(int const Np){
this->N = Np;
this->q = new double[Np];
this->mx = new double[Np];
this->my = new double[Np];
this->force = new double[Np];
}
void Chain::initCond(){
for (int i=0; i<N; i++){
q[i] = 0.0;
force[i] = 0.0;
}
}
void Chain::changeQ(double delta){
int i=0;
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++){
q[i] = q[i] + delta*i + 1.0*i/N;
}
}
}
void Chain::calMag_i(){
int i =0;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++){
mx[i] = cos(q[i]);
my[i] = sin(q[i]);
}
}
}
void Chain::calForce001(){
int i;
int j;
double fij =0.0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for private(j, fij)
for (i=0; i<N; i++){
force[i] = 0.0;
for (j=0; j<i; j++){
fij = my[i]*mx[j] - mx[i]*my[j];
#pragma omp critical
{
force[i] += fij;
force[j] += -fij;
}
}
}
}
double time = omp_get_wtime() - start_time;
cout <<"time = " << time <<endl;
}
So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001(). The execution time are:
with openMP 3.939s
without openMP 0.217s
Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull.
Thanks in advance.
Carlos
Edit:
In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are
with omp :0.0376656
without omp: 0.00196766
So with omp method is 20 times slower.
Otherwise, i'm also calculate the time for the calMag_i() method
with omp: 3.3845e-05
without omp: 9.9516e-05
for this method omp is 3 times faster.
I hope this confirm that the latency problem is in the calForce001() method.
There are three reasons why you don't benefit from any speedup.
you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.
you use #pragma omp critical. On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.
Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.
With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in e.g. Chain::calForce001(), use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.

OpenMP: having a complete 'for' loop into each thread

I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}