Parallelization with openMP: shared and critical clauses - c++

Below is a portion of code parallelized via openMP. Arrays, ap[] and sc[], are emposed to addition assignment so, I decided to make them shared and then put them in critical clause section since reduction clause does not accept arrays. But it gives a different result than its serial counterpart. Where is the problem?
Vector PN, Pf, Nf; // Vector is user-defined structure
Vector NNp, PPp;
Vector gradFu, gradFv, gradFw;
float dynVis_eff, SGSf;
float Xf_U, Xf_H;
float mf_P, mf_N;
float an_diff, an_conv_P, an_conv_N, an_trans;
float sc_cd, sc_pres, sc_trans, sc_SGS, sc_conv_P, sc_conv_N;
float ap_trans;
#pragma omp parallel for
for (int e=0; e<nElm; ++e)
{
ap[e] = 0.f;
sc[e] = 0.f;
}
#pragma omp parallel for shared(ap,sc)
for (int f=0; f<nFaces; ++f)
{
PN = cntE[face_N[f]] - cntE[face_P[f]];
Pf = cntF[f] - cntE[face_P[f]];
Nf = cntF[f] - cntE[face_N[f]];
PPp = Pf - (Pf|norm(PN))*norm(PN);
NNp = Nf - (Nf|norm(PN))*norm(PN);
mf_P = mf[f];
mf_N = -mf[f];
SGSf = (1.f-ifac[f]) * SGSvis[face_P[f]]
+ ifac[f] * SGSvis[face_N[f]];
dynVis_eff = dynVis + SGSf;
an_diff = dynVis_eff * Ad[f] / mag(PN);
an_conv_P = -neg(mf_P);
an_conv_N = -neg(mf_N);
an_P[f] = an_diff + an_conv_P;
an_N[f] = an_diff + an_conv_N;
// cross-diffusion
sc_cd = an_diff * ( (gradVel[face_N[f]]|NNp) - (gradVel[face_P[f]]|PPp) );
#pragma omp critical
{
ap[face_P[f]] += an_N[f];
ap[face_N[f]] += an_P[f];
sc[face_P[f]] += sc_cd + sc_conv_P;
sc[face_N[f]] += -sc_cd + sc_conv_N;
}

You have not declared whether all the other variables in your parallel clause should be shared or not. You can do this generically with the default clause. If no default is specified, the variables are all shared, which is causing the problems in your code.
In your case, I'm guessing you should go for
#pragma omp parallel for default(none), shared(ap,sc,face_N,face_P,cntF,cntE,mf,ifac,Ad,an_P,an_N,SGSvis,dynVis), private(PN,Pf,Nf,PPp,NNp,mf_P,mf_N,SGSf,dynVis_eff,an_diff,an_conv_P,an_conv_N,sc_cd)
I strongly recommend always using default(none) so that the compiler complains every time you don't declare a variable explicitly and forces you to think about it explicitly.

Related

How to solve dependencies in for loop in a multithread code?

I have trouble resolving dependecies in for loop using OpenMP so the program will execute faster. This is how I did it and it works, but I need a faster solution. Does anybody know to do this so it will work faster?
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){
#pragma omp ordered
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
current = openSet[i];
}
}
and the second for loop is this one:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
#pragma omp ordered
tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
if(contains(openSet, neighbours[i]) == false){
openSet.push_back(neighbours[i]);
}
}
}
EDIT: I didn't mention what i was even doing here. I was implementing A* algorithm and this code is from wikipedia. Plus i wana add 2 more variables so i don't confuse anyone.
PAIR current = {};
int maxVal = INT32_MAX;
First of all, you need to make sure, that this is your hot spot. Then us a proper test suite, in order to make sure that you actually gain performance. Use a tool such as ´google_benchmark´. Make sure you compiled in release mode, otherwise your measurements are completely spoiled.
This said, I think you are looking for the max reduction
#pragma omp parallel for reduction(max : maxVal )
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] > maxVal){
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
}
}
´current´ seams to be superfluous. I think the comparison has been mixed up.
Can you access the data in ´fScores´ in a linear fashion. You will have a lot of cache misses using the indirection over ´openSet´. If you can get rid of this indirection somehow, you will have a high speedup in single and multi-threaded scenarios.
In the second loop the ´push_back´ will spoil your performance. I had a similar problem. For me it was very beneficial to
create a vector with the maximal possilbe length
initialise it with an empty value
set it properly using openmp, where a criterion was fulfilled.
Check for the empty value, when using the vector.
It seems to me that you misunderstood what the OpenMP ordered clause actually does. From the OpenMP Standard one can read:
The ordered construct either specifies a structured block in a
worksharing-loop, simd, or worksharing-loop SIMD region that will be
executed in the order of the loop iterations, or it is a stand-alone
directive that specifies cross-iteration dependences in a doacross
loop nest. The ordered construct sequentializes and orders the
execution of ordered regions while allowing code outside the region to
run in parallel.
or more informally:
The ordered clause works like this: different threads execute
concurrently until they encounter the ordered region, which is then
executed sequentially in the same order as it would get executed in a
serial loop.
Based on the way you have used it, it seems that you have mistaken the ordered clause for the OpenMP critical clause:
The critical construct restricts execution of the associated
structured block to a single thread at a time.
Therefore, with the ordered clause your code is basically running sequentially, with the additional overhead of the parallelism. Nevertheless, even if you have used the critical constructor instead, the overhead would be too high, since threads would be locking in every loop iteration.
At first glance for the first loop you could use the OpenMP reduction clause (i.e., reduction(max :maxVal)), which from the standard one can read:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing
constructs, a private copy of each list item is created, one for each
implicit task, as if the private clause had been used. (...) The
private copy is then initialized as specified above. At the end of the
region for which the reduction clause was specified, the original list
item is updated by combining its original value with the final value
of each of the private copies, using the combiner of the specified
reduction-identifier.
For a more detailed explanation on how the reduction clause works have a look a this SO Thread.
Notwithstanding, you are updating two variables, namely maxVal and current. Hence, making it harder to solve those dependencies with the reduction clause alone. Nonetheless, one approach is to create a shared data structure among the threads, where each thread updates a given position of that shared structure. At the end of the parallel region, the master thread update the original values of maxVal and current, accordingly.
So instead of:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){ // <-- you meant '>' not '<'
#pragma omp ordered
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
current = openSet[i];
}
}
you could try the following:
int shared_maxVal[tc] = {INT32_MAX};
int shared_current[tc] = {0};
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(openSet, fScores)
for(int i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] > shared_maxVal[threadID]){
shared_maxVal[threadID] = fScores[openSet[i].x * dim + openSet[i].y];
shared_current[threadID] = openSet[i];
}
}
}
for(int i = 0; i < tc; i++){
if(maxVal < shared_maxVal[i]){
maxVal = shared_maxVal[i];
current = shared_current[i];
}
}
For your second loop:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
#pragma omp ordered
tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
if(contains(openSet, neighbours[i]) == false){
openSet.push_back(neighbours[i]);
}
}
}
Some of the aforementioned advice still holds. Moreover, do not make the variable tentative_gScore shared among threads. Otherwise, you need to guarantee mutual exclusion on the accesses to that variable. As it is your code has a race-condition, namely threads may update the variable tentative_gScore while other threads are reading it. Simply declare the tentative_gScore variable inside the loop so that it is private to each thread.
Assuming that different threads cannot access to the same position of the arrays cameFrom, gScores and fScores, the next thing you need to do is to create an array of openSets, and assign each position of that array to a different thread. In this manner, threads can update their respective positions without having to use some synchronization mechanism.
At the end of the parallel region merge the shared structure to the same (original) openSet.
Your second loop might loop like the following:
// Create an array of "openSets" let us named "shared_openSet"
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(neighbours, gScores, fScores)
for(int i = 0;i < neighbours.size();i++){
// I just assume the type in but you can change if for the real type
int tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore();
if(contains(openSet, neighbours[i]) == false){
shared_openSet[threadID].push_back(neighbours[i]);
}
}
}
}
// merge all the elements from shared_openSet into openSet.

Safe explicit vectorization of a seemingly simple loop

New here, hoping you can help. I am attempting to explicitly vectorize both of the for loops in the below member function code as they are the main runtime bottleneck and auto-vectorization does not work due to dependencies. For the life of me, however, I cannot find "safe" clause/reduction.
PSvector& RTMap::Apply(PSvector& X) const
{
PSvector Y(0);
double k=0;
// linear map
#pragma omp simd
for(RMap::const_itor r = rterms.begin(); r != rterms.end(); r++)
{
Y[r->i] += r->val * X[r->j];
}
// non-linear map
#pragma omp simd
for(const_itor t = tterms.begin(); t != tterms.end(); t++)
{
Y[t->i] += t->val * X[t->j] * X[t->k];
}
Y.location() = X.location();
Y.type() = X.type();
Y.id() = X.id();
Y.sd() = X.sd();
return X = Y;
}
Note that the #pragmas as written don't work because of race conditions. Is there a method of declaring reduction which could work? I tried something like:
#pragma omp declare reduction(+:PSvector:(*omp_out.getvec())+=(*omp_in.getvec()))
which compiles (icpc) but seems to produce nonsense.
/cheers

Why my C code is slower using OpenMP

I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
omp_set_dynamic(0);
omp_set_num_threads(4);
float *h1=new float[nvi];
float *h2=new float[npi];
while(tol>0.001)
{
std::fill_n(h2, npi, 0);
int k,i;
float h222=0;
#pragma omp parallel for private(i,k) reduction (+: h222)
for (i=0;i<npi;++i)
{
int p1=ppi[i];
int m = frombus[p1];
for (k=0;k<N;++k)
{
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i]=h222;
}
//*********** h3*****************
std::fill_n(h3, nqi, 0);
float h333=0;
#pragma omp parallel for private(i,k) reduction (+: h333)
for (int i=0;i<nqi;++i)
{
int q1=qi[i];
int m = frombus[q1];
for (int k=0;k<N;++k)
{
h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k])
- B[m-1][k]*cos(del[m-1]-del[k]));
}
h3[i]=h333;
}
.
.
.
}
I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (i.e. h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:
float h222 = 0;
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
#pragma omp parallel for reduction(+:h222)
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k. This should give the same result with and without OpenMP and still be fast.
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
float h222 = 0;
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
h2[i] += h2[i-1];
}
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.

OpenMP and #pragma omp atomic

I have an issue with OpenMP. MSVS compilator throws me "pragma omp atomic has improper form".
I don't have any idea why.
Code: (program appoints PI number using integrals method)
#include <stdio.h>
#include <time.h>
#include <omp.h>
long long num_steps = 1000000000;
double step;
int main(int argc, char* argv[])
{
clock_t start, stop;
double x, pi, sum=0.0;
int i;
step = 1./(double)num_steps;
start = clock();
#pragma omp parallel for
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
#pragma omp atomic //this part contains error
sum = sum + 4.0/(1.+ x*x);
}
pi = sum*step;
stop = clock();
// some printf to show results
return 0;
}
Your program is a perfectly syntactically correct OpenMP code by the current OpenMP standards (e.g. it compiles unmodified with GCC 4.7.1), except that x should be declared private (which is not a syntactic but rather a semantic error). Unfortunately Microsoft Visual C++ implements a very old OpenMP specification (2.0 from March 2002) which only allows the following statements as acceptable in an atomic construct:
x binop= expr
x++
++x
x--
--x
Later versions included x = x binop expr, but MSVC is forever stuck at OpenMP version 2.0 even in VS2012. Just for comparison, the current OpenMP version is 3.1 and we expect 4.0 to come up in the following months.
In OpenMP 2.0 your statement should read:
#pragma omp atomic
sum += 4.0/(1.+ x*x);
But as already noticed, it would be better (and generally faster) to use reduction:
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0; i<num_steps; i++)
{
x = (i + .5)*step;
sum = sum + 4.0/(1.+ x*x);
}
(you could also write sum += 4.0/(1.+ x*x);)
Try to change sum = sum + 4.0/( 1. + x*x ) to sum += 4.0/(1.+ x*x) , But I'm afraid this won't work either. You can try to split the work like this:
x = (i + .5)*step;
double xx = 4.0/(1.+ x*x);
#pragma omp atomic //this part contains error
sum += xx;
this should work, but I am not sure whether it fits your needs.
Replace :
#pragma omp atomic
by #pragma omp reduction(+:sum) or #pragma omp critical
But I guess #pragma omp reduction will be a better option as you have sum+=Var;
Do like this:
x = (i + .5)*step;
double z = 4.0/(1.+ x*x);
#pragma omp reduction(+:sum)
sum += z;
You probably need a recap about #pragma more than the real solution to your problem.
#pragma are a set of non-standard, compiler specific, and most of the time, platform/system specific - meaning that the behaviour can be different on different machines with the same OS or simply on machines with different setups - set of instrunctions for the pre-processor.
As consequence any issue with pragma can be solved only if you look at the official documentation for your compiler for your platform of choice, here are 2 links.
http://msdn.microsoft.com/en-us/library/d9x1s805.aspx
http://msdn.microsoft.com/en-us/library/0ykxx45t.aspx
For the standard C/C++ #pragma doesn't exist.

Are pointers private in OpenMP parallel sections?

I've added OpenMP to an existing code base in order to parallelize a for loop. Several variables are created inside the scope of the parallel for region, including a pointer:
#pragma omp parallel for
for (int i = 0; i < n; i++){
[....]
Model *lm;
lm->myfunc();
lm->anotherfunc();
[....]
}
In the resulting output files I noticed inconsistencies, presumably caused by a race condition. I ultimately resolved the race condition by using an omp critical. My question remains, though: is lm private to each thread, or is it shared?
Yes, all variables declared inside the OpenMP region are private. This includes pointers.
Each thread will have its own copy of the pointer.
It lets you do stuff like this:
int threads = 8;
int size_per_thread = 10000000;
int *ptr = new int[size_per_thread * threads];
#pragma omp parallel num_threads(threads)
{
int id = omp_get_thread_num();
int *my_ptr = ptr + size_per_thread * id;
// Do work on "my_ptr".
}