Multi-thread for loop by OpenMP - c++

I am writing my first OpenMP project. This is my work:
myFooFunction{
int64_t Gm = 0;
double* dist = (double*)middleManDouble;
int64_t LengthofData = Frames * Height * Width;
mexEvalString("tic");
if (BitDepth == 10){
const unsigned __int16* src__int16 = (unsigned __int16*)middleMan;
//#pragma omp parallel
//#pragma omp for
#pragma omp parallel for
for (Gm = 0; Gm < LengthofData; ++Gm){
dist[Gm] = (double)(src__int16[Gm]);
}
}
else if (BitDepth == 8){
const unsigned __int8* src__int8 = (unsigned __int8*)middleMan;
//#pragma omp parallel
// #pragma omp for
#pragma omp parallel for
for (Gm = 0; Gm < LengthofData; ++Gm){
dist[Gm] = (double)(src__int8[Gm]);
}
}
mexEvalString("toc");
}
But I don't see improve in executaion time of for loop despite the fact that my CPU cores utilizations all are upper than 95%. What is wrong with my code?
Am I using OpenMp in correct way? I just want to execute the for loop on multi thread.

Related

atomic inside a single construct

In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?
There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}
each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.

openMP: call parallel function from parallel region

I'm trying to make my serial programm parallel with openMP. Here is the code where I have a big parallel region with a number of internal "#pragma omp for" sections. In serial version I have a function fftw_shift() which has "for" loops inside it too.
The question is how to rewrite the fftw_shift() function properly in order to already existed threads in the external parallel region could split "for" loops inside with no nested threads.
I'm not sure that my realisation works correctly. There is the way to inline the whole function in parallel region but I'm trying to realise how to deal with it in the described situation.
int fftw_shift(fftw_complex *pulse, fftw_complex *shift_buf, int
array_size)
{
int j = 0; //counter
if ((pulse != nullptr) || (shift_buf != nullptr)){
if (omp_in_parallel()) {
//shift the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size / 2; j++) {
//left to right
shift_buf[(array_size / 2) + j][REAL] = pulse[j][REAL]; //real
shift_buf[(array_size / 2) + j][IMAG] = pulse[j][IMAG]; //imaginary
//right to left
shift_buf[j][REAL] = pulse[(array_size / 2) + j][REAL]; //real
shift_buf[j][IMAG] = pulse[(array_size / 2) + j][IMAG]; //imaginary
}
//rewrite the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size; j++) {
pulse[j][REAL] = shift_buf[j][REAL]; //real
pulse[j][IMAG] = shift_buf[j][IMAG]; //imaginary
}
return 0;
}
}
....
#pragma omp parallel firstprivate(x, phase) if(array_size >=
OMP_THREASHOLD)
{
// First half-step
#pragma omp for schedule(dynamic)
for (x = 0; x < array_size; x++) {
..
}
// Forward FTW
fftw_shift(pulse_x, shift_buf, array_size);
#pragma omp master
{
fftw_execute(dft);
}
#pragma omp barrier
fftw_shift(pulse_kx, shift_buf, array_size);
...
}
If you call fftw_shift from a parallel region - but not a work-sharing construct (i.e. not in a parallel for), then you can just use omp for just as if you were inside a parallel region. This is called an orphaned directive.
However, your loops just copy data, so don't expect a perfect speedup depending on your system.

How to parallel this two loop using OpenMP?

Is this the correct way to parallel two for loops by using the "#pragma omp single nowait" and "#pragma omp for for two different loops"? Or is there any other way to do it?
#pragma omp single nowait
{
for (i = ; i < N; i += )
{
D[i] = x * A[i] + x * B[i];
}
#pragma omp for
for (i = 0; i < N; i++)
C[i] = c * D[i];
}
} // end omp parallel
You should note that you can actually merge the two for loops into one...since for each i, you compute D[i], and later on, the computation for C[i] is just dependent on D[i].
you should combine the loops that way, and then just use omp for as you did for your second loop.

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

Multithreading, OpenMP,C

I saw that piece of some code below in a forum but when i started to compile it i get some errors.. I want to parallel the area from #pragma scop up to #pragma endscop.
/* Main computational kernel. The whole function will be timed,
including the call and return. */
static
void kernel_fdtd_2d(int tmax,
int nx,
int ny,
DATA_TYPE POLYBENCH_2D(ex,NX,NY,nx,ny),
DATA_TYPE POLYBENCH_2D(ey,NX,NY,nx,ny),
DATA_TYPE POLYBENCH_2D(hz,NX,NY,nx,ny),
DATA_TYPE POLYBENCH_1D(_fict_,TMAX,tmax))
{
int t, i, j;
#pragma scop
#pragma omp parallel private (t,i,j)
{
#pragma omp master
{
for (t = 0; t < _PB_TMAX; t++)
{
#pragma omp for
for (j = 0; j < _PB_NY; j++)
ey[0][j] = _fict_[t];
#pragma omp barrier
#pragma omp for collapse(2) schedule(static)
for (i = 1; i < _PB_NX; i++)
for (j = 0; j < _PB_NY; j++)
ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);
#pragma omp barrier
#pragma omp for collapse(2) schedule(static)
for (i = 0; i < _PB_NX; i++)
for (j = 1; j < _PB_NY; j++)
ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);
#pragma omp barrier
#pragma omp for collapse(2) schedule(static)
for (i = 0; i < _PB_NX - 1; i++)
for (j = 0; j < _PB_NY - 1; j++)
hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] + ey[i+1][j] - ey[i][j]);
#pragma omp barrier
}
}
}
#pragma endscop
}
int main(int argc, char** argv)
{
/* Retrieve problem size. */
int tmax = TMAX;
int nx = NX;
int ny = NY;
/* Variable declaration/allocation. */
POLYBENCH_2D_ARRAY_DECL(ex,DATA_TYPE,NX,NY,nx,ny);
POLYBENCH_2D_ARRAY_DECL(ey,DATA_TYPE,NX,NY,nx,ny);
POLYBENCH_2D_ARRAY_DECL(hz,DATA_TYPE,NX,NY,nx,ny);
POLYBENCH_1D_ARRAY_DECL(_fict_,DATA_TYPE,TMAX,tmax);
/* Initialize array(s). */
init_array (tmax, nx, ny,
POLYBENCH_ARRAY(ex),
POLYBENCH_ARRAY(ey),
POLYBENCH_ARRAY(hz),
POLYBENCH_ARRAY(_fict_));
/* Start timer. */
polybench_start_instruments;
/* Run kernel. */
kernel_fdtd_2d (tmax, nx, ny,
POLYBENCH_ARRAY(ex),
POLYBENCH_ARRAY(ey),
POLYBENCH_ARRAY(hz),
POLYBENCH_ARRAY(_fict_));
/* Stop and print timer. */
polybench_stop_instruments;
polybench_print_instruments;
/* Prevent dead-code elimination. All live-out data must be printed
by the function call in argument. */
polybench_prevent_dce(print_array(nx, ny, POLYBENCH_ARRAY(ex),
POLYBENCH_ARRAY(ey),
POLYBENCH_ARRAY(hz)));
/* Be clean. */
POLYBENCH_FREE_ARRAY(ex);
POLYBENCH_FREE_ARRAY(ey);
POLYBENCH_FREE_ARRAY(hz);
POLYBENCH_FREE_ARRAY(_fict_);
return 0;
}
The errors are like:
stencils/fdtd-2d/fdtd-2dp.c:80:9: error: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp for
^
stencils/fdtd-2d/fdtd-2dp.c:83:9: error: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp barrier
^
stencils/fdtd-2d/fdtd-2dp.c:84:9: error: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp for collapse(2) schedule(static)
^
stencils/fdtd-2d/fdtd-2dp.c:88:9: error: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp barrier
^
stencils/fdtd-2d/fdtd-2dp.c:89:9: error: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp for collapse(2) schedule(static)
^
stencils/fdtd-2d/fdtd-2dp.c:93:9: error: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp barrier
^
stencils/fdtd-2d/fdtd-2dp.c:94:9: error: work-sharing region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp for collapse(2) schedule(static)
^
stencils/fdtd-2d/fdtd-2dp.c:98:9: error: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
#pragma omp barrier
^
Any help appreciated in how may i compile this..
Honestly, this is pretty poor OpenMP code. It does not consider data usage throughout the algorithm. What you probably want is:
int t, i, j;
#pragma omp parallel private (t,i,j)
{
for (t = 0; t < _PB_TMAX; t++)
{
#pragma omp for nowait
for (j = 0; j < _PB_NY; j++)
ey[0][j] = _fict_[t];
#pragma omp for collapse(2) nowait schedule(static)
for (i = 1; i < _PB_NX; i++)
for (j = 0; j < _PB_NY; j++)
ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);
#pragma omp for collapse(2) schedule(static)
for (i = 0; i < _PB_NX; i++)
for (j = 1; j < _PB_NY; j++)
ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);
// #pragma omp barrier <- Implicit if nowait not specified
#pragma omp for collapse(2) schedule(static)
for (i = 0; i < _PB_NX - 1; i++)
for (j = 0; j < _PB_NY - 1; j++)
hz[i][j] = hz[i][j] - 0.7*(ex[i][j+1] - ex[i][j] + ey[i+1][j] - ey[i][j]);
// #pragma omp barrier <- Implicit if nowait not specified
}
}
The barriers should be removed because they are implicit after the for loop ends without a nowait specified.
Furthermore, I believe the first two barriers should be entirely removed because there is no thread dependence between the first three loops -- if a thread finishes its portion of the loop and immediately starts a portion of the next loop, there is no chance of a race condition. You can add the nowait clause to override the implicit barrier at the end of the omp for directive.
Finally, if _PB_NX and _PB_NY are large-ish, then you are very unlikely to gain any benefit by collapsing the nested loops. I would imagine that removing the collapse(2) could slightly improve the performance of the overall function.
Hope this helps.
Remove the #pragma omp master statement from your code. That will fix the compilation issue. You probably don't want to run that block 'only' in the master thread, because then you will not get any performance benefit of using Open MP.