How to solve dependencies in for loop in a multithread code? - c++

I have trouble resolving dependecies in for loop using OpenMP so the program will execute faster. This is how I did it and it works, but I need a faster solution. Does anybody know to do this so it will work faster?
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){
#pragma omp ordered
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
current = openSet[i];
}
}
and the second for loop is this one:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
#pragma omp ordered
tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
if(contains(openSet, neighbours[i]) == false){
openSet.push_back(neighbours[i]);
}
}
}
EDIT: I didn't mention what i was even doing here. I was implementing A* algorithm and this code is from wikipedia. Plus i wana add 2 more variables so i don't confuse anyone.
PAIR current = {};
int maxVal = INT32_MAX;

First of all, you need to make sure, that this is your hot spot. Then us a proper test suite, in order to make sure that you actually gain performance. Use a tool such as ´google_benchmark´. Make sure you compiled in release mode, otherwise your measurements are completely spoiled.
This said, I think you are looking for the max reduction
#pragma omp parallel for reduction(max : maxVal )
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] > maxVal){
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
}
}
´current´ seams to be superfluous. I think the comparison has been mixed up.
Can you access the data in ´fScores´ in a linear fashion. You will have a lot of cache misses using the indirection over ´openSet´. If you can get rid of this indirection somehow, you will have a high speedup in single and multi-threaded scenarios.
In the second loop the ´push_back´ will spoil your performance. I had a similar problem. For me it was very beneficial to
create a vector with the maximal possilbe length
initialise it with an empty value
set it properly using openmp, where a criterion was fulfilled.
Check for the empty value, when using the vector.

It seems to me that you misunderstood what the OpenMP ordered clause actually does. From the OpenMP Standard one can read:
The ordered construct either specifies a structured block in a
worksharing-loop, simd, or worksharing-loop SIMD region that will be
executed in the order of the loop iterations, or it is a stand-alone
directive that specifies cross-iteration dependences in a doacross
loop nest. The ordered construct sequentializes and orders the
execution of ordered regions while allowing code outside the region to
run in parallel.
or more informally:
The ordered clause works like this: different threads execute
concurrently until they encounter the ordered region, which is then
executed sequentially in the same order as it would get executed in a
serial loop.
Based on the way you have used it, it seems that you have mistaken the ordered clause for the OpenMP critical clause:
The critical construct restricts execution of the associated
structured block to a single thread at a time.
Therefore, with the ordered clause your code is basically running sequentially, with the additional overhead of the parallelism. Nevertheless, even if you have used the critical constructor instead, the overhead would be too high, since threads would be locking in every loop iteration.
At first glance for the first loop you could use the OpenMP reduction clause (i.e., reduction(max :maxVal)), which from the standard one can read:
The reduction clause can be used to perform some forms of recurrence
calculations (...) in parallel. For parallel and work-sharing
constructs, a private copy of each list item is created, one for each
implicit task, as if the private clause had been used. (...) The
private copy is then initialized as specified above. At the end of the
region for which the reduction clause was specified, the original list
item is updated by combining its original value with the final value
of each of the private copies, using the combiner of the specified
reduction-identifier.
For a more detailed explanation on how the reduction clause works have a look a this SO Thread.
Notwithstanding, you are updating two variables, namely maxVal and current. Hence, making it harder to solve those dependencies with the reduction clause alone. Nonetheless, one approach is to create a shared data structure among the threads, where each thread updates a given position of that shared structure. At the end of the parallel region, the master thread update the original values of maxVal and current, accordingly.
So instead of:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(openSet, maxVal, current, fScores)
for(i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] < maxVal){ // <-- you meant '>' not '<'
#pragma omp ordered
maxVal = fScores[openSet[i].x * dim + openSet[i].y];
current = openSet[i];
}
}
you could try the following:
int shared_maxVal[tc] = {INT32_MAX};
int shared_current[tc] = {0};
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(openSet, fScores)
for(int i = 0;i < openSet.size();i++){
if(fScores[openSet[i].x * dim + openSet[i].y] > shared_maxVal[threadID]){
shared_maxVal[threadID] = fScores[openSet[i].x * dim + openSet[i].y];
shared_current[threadID] = openSet[i];
}
}
}
for(int i = 0; i < tc; i++){
if(maxVal < shared_maxVal[i]){
maxVal = shared_maxVal[i];
current = shared_current[i];
}
}
For your second loop:
#pragma omp parallel for num_threads(tc) ordered schedule(dynamic, 1) private(i) shared(neighbours, openSet, gScores, fScores, tentative_gScore)
for(i = 0;i < neighbours.size();i++){
#pragma omp ordered
tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore(); //(p.x, p.y, xEnd, yEnd)
if(contains(openSet, neighbours[i]) == false){
openSet.push_back(neighbours[i]);
}
}
}
Some of the aforementioned advice still holds. Moreover, do not make the variable tentative_gScore shared among threads. Otherwise, you need to guarantee mutual exclusion on the accesses to that variable. As it is your code has a race-condition, namely threads may update the variable tentative_gScore while other threads are reading it. Simply declare the tentative_gScore variable inside the loop so that it is private to each thread.
Assuming that different threads cannot access to the same position of the arrays cameFrom, gScores and fScores, the next thing you need to do is to create an array of openSets, and assign each position of that array to a different thread. In this manner, threads can update their respective positions without having to use some synchronization mechanism.
At the end of the parallel region merge the shared structure to the same (original) openSet.
Your second loop might loop like the following:
// Create an array of "openSets" let us named "shared_openSet"
#pragma omp parallel num_threads(tc)
{
int threadID = omp_get_thread_num();
#pragma omp for shared(neighbours, gScores, fScores)
for(int i = 0;i < neighbours.size();i++){
// I just assume the type in but you can change if for the real type
int tentative_gScore = gScores[current.x * dim + current.y] + 1;
if(tentative_gScore < gScores[neighbours[i].x * dim + neighbours[i].y]){
cameFrom[neighbours[i].x * dim + neighbours[i].y] = current;
gScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore;
fScores[neighbours[i].x * dim + neighbours[i].y] = tentative_gScore + hScore();
if(contains(openSet, neighbours[i]) == false){
shared_openSet[threadID].push_back(neighbours[i]);
}
}
}
}
// merge all the elements from shared_openSet into openSet.

Related

OpenMP. Parallelization of two consecutive cycles

I am studying OpenMP and have written an implementation of shaker sorting. There are 2 consecutive cycles here, and in order for them to be called sequentially, I added blockers in the form of omp_init_lock, omp_destroy_lock, but still the result is incorrect. Please tell me how you can parallelize two consecutive cycles. My code is below:
int Left, Right;
Left = 1;
Right = ARR_SIZE;
while (Left <= Right)
{
omp_init_lock(&lock);
#pragma omp parallel reduction(+:Left) num_threads(4)
{
#pragma omp for
for (int i = Right; i >= Left; i--) {
if (Arr[i - 1] > Arr[i]) {
int temp;
temp = Arr[i];
Arr[i] = Arr[i - 1];
Arr[i - 1] = temp;
}
}
Left++;
}
omp_destroy_lock(&lock);
omp_init_lock(&lock);
#pragma omp parallel reduction(+:Right) num_threads(4)
{
#pragma omp for
for (int i = Left; i <= Right; i++) {
if (Arr[i - 1] > Arr[i]) {
int temp;
temp = Arr[i];
Arr[i] = Arr[i - 1];
Arr[i - 1] = temp;
}
}
Right--;
}
omp_destroy_lock(&lock);
}
You can only make something an omp for if the iterations are independent. Yours clearly aren't.
Your locks serve no purpose. Two parallel regions are always done in sequence. So you can remove the locks.
You seem to have several misconceptions about how OpenMP works.
Two parallel sections don't execute in parallel. This is fork-join parallelism. The parallel section itself is executed by multiple threads which then join back up at the end of the parallel section.
Your code looks like you expected them to work like pragma omp sections. Side note: Unless you have absolutely no other choice and/or you know exactly what you are doing, don't use sections. They don't scale well.
Your use of the lock API is wrong. omp_init_lock initializes a lock object. It doesn't acquire it. Likewise the destroy function deallocates it, it doesn't release the lock. If you ever want to acquire a lock, use omp_set_lock and omp_unset_lock on locks that you initialize once before you enter a parallel section.
Generally speaking, if you need a lock for an extended section of your code, it will not parallelize. Read up on Amdahl's law. Locks are only useful if used rarely or if the chance of two threads competing for the same lock at the same time is low.
Your code contains race conditions. Since you used pragma omp for, two different threads may execute the i'th and (i-1)'th iteration at the same time. That means they will touch the same integers. That's undefined behavior and will lead to them stepping on each other's toes, so to speak.
I have no idea what you wanted to do with those reductions.
How to solve this
Well, traditional shaker sort cannot work in parallel because within one iteration of the outer loop, an element may travel the whole distance to the end of the range. That requires an amount of inter-thread coordination that is infeasible.
What you can do is a variation of bubble sort where each thread looks at two values and swaps them. Move this window back and forth and values will slowly travel towards their correct position.
This should work:
#include <utility>
// using std::swap
void shake_sort(int* arr, int n) noexcept
{
using std::swap;
const int even_to_odd = n / 2;
const int odd_to_even = (n - 1) / 2;
bool any_swap;
do {
any_swap = false;
# pragma omp parallel for reduction(|:any_swap)
for(int i = 0; i < even_to_odd; ++i) {
int left = i * 2;
int right = left + 1;
if(arr[left] > arr[right]) {
swap(arr[left], arr[right]);
any_swap = true;
}
}
# pragma omp parallel for reduction(|:any_swap)
for(int i = 0; i < odd_to_even; ++i) {
int left = i * 2 + 1;
int right = left + 1;
if(arr[left] > arr[right]) {
swap(arr[left], arr[right]);
any_swap = true;
}
}
} while(any_swap);
}
Note how you can't exclude the left and right border because one outer iteration cannot guarantee that the value there is correct.
Other remarks:
Others have already commented on how std::swap makes the code more readable
You don't need to specify num_threads. OpenMP can figure this out itself

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

Reduction(op:var) has the same effect as shared(var)

I've tried this code snippet for reduction(op:var) proof of concept, it worked fine and gave a result = 656700
int i, n, chunk;
float a[100], b[100], result;
/* Some initializations */
n = 100; chunk = 10; result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
//Fork has only for loop
#pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result)
for (i=0; i < n; i++)
result = result + (a[i] * b[i]);
printf("Final result= %f\n",result);
When i tried the same code but without reduction(+:result) it gave me the same result 656700 !
I think this makes very sense as reduction rely on a shared variable, in another words, shared clause would be sufficient for such operation.
I am confused!
Reduction uses a shared variable visible to you, but private copies of the variable internally. When you forget the reduction clause more threads may try to update the value of the reduction variable at the same time. That is a race condition. The result may likely be wrong and it will also will slow, because of the competition for the same resource.
With reduction, every thread has a private copy of the variable and works with it. When the reduction region finishes, the private copies are reduced using the reduction operator to the final shared variable.
shared clause would be sufficient for such operation.
Nope.
When you remove reduction(+:result), program cause data race on result variable and the result is unstable.
This means you may get wrong result, or correct result occasionally.

Parallel for loop in openmp

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
{
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold
increase of performance? Am I coding the loop correctly?
You should make use of the OpenMP reduction clause for x and y:
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.
Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total
See - superlinear speed-up :)
let's try to understand how parallelize simple for loop using OpenMP
#pragma omp parallel
#pragma omp for
for(i = 1; i < 13; i++)
{
c[i] = a[i] + b[i];
}
assume that we have 3 available threads, this is what will happen
firstly
Threads are assigned an independent set of iterations
and finally
Threads must wait at the end of work-sharing construct
Because this question is highly viewed I decided to add a bit a OpenMP background to help those visiting it
The #pragma omp parallel creates a parallel region with a team of threads, where each thread executes the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for. In your case, this would mean that:
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
A team of threads will be created, and to each of those threads will be assigned chunks of the iterations of the outermost loop.
To make it more illustrative, with 4 threads the #pragma omp parallel for with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
c[i]=a[i]+b[i];
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
Armed with this knowledge, and looking at your code, one can see that you have a race-condition on the updates of the variables 'x' and 'y'. Those variables are shared among threads and update inside the parallel region, namely:
x += rand_g1;
y += rand_g2;
To solve this race-condition you can use OpenMP' reduction clause:
Specifies that one or more variables that are private to each thread
are the subject of a reduction operation at the end of the parallel
region.
Informally, the reduction clause, will create for each thread a private copy of the variables 'x' and 'y', and at the end of the parallel region perform the summation among all those 'x' and 'y' variables into the original 'x' and 'y' variables from the initial thread.
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
What you can achieve at most(!) is a linear speedup.
Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.

Parallelization with openMP: shared and critical clauses

Below is a portion of code parallelized via openMP. Arrays, ap[] and sc[], are emposed to addition assignment so, I decided to make them shared and then put them in critical clause section since reduction clause does not accept arrays. But it gives a different result than its serial counterpart. Where is the problem?
Vector PN, Pf, Nf; // Vector is user-defined structure
Vector NNp, PPp;
Vector gradFu, gradFv, gradFw;
float dynVis_eff, SGSf;
float Xf_U, Xf_H;
float mf_P, mf_N;
float an_diff, an_conv_P, an_conv_N, an_trans;
float sc_cd, sc_pres, sc_trans, sc_SGS, sc_conv_P, sc_conv_N;
float ap_trans;
#pragma omp parallel for
for (int e=0; e<nElm; ++e)
{
ap[e] = 0.f;
sc[e] = 0.f;
}
#pragma omp parallel for shared(ap,sc)
for (int f=0; f<nFaces; ++f)
{
PN = cntE[face_N[f]] - cntE[face_P[f]];
Pf = cntF[f] - cntE[face_P[f]];
Nf = cntF[f] - cntE[face_N[f]];
PPp = Pf - (Pf|norm(PN))*norm(PN);
NNp = Nf - (Nf|norm(PN))*norm(PN);
mf_P = mf[f];
mf_N = -mf[f];
SGSf = (1.f-ifac[f]) * SGSvis[face_P[f]]
+ ifac[f] * SGSvis[face_N[f]];
dynVis_eff = dynVis + SGSf;
an_diff = dynVis_eff * Ad[f] / mag(PN);
an_conv_P = -neg(mf_P);
an_conv_N = -neg(mf_N);
an_P[f] = an_diff + an_conv_P;
an_N[f] = an_diff + an_conv_N;
// cross-diffusion
sc_cd = an_diff * ( (gradVel[face_N[f]]|NNp) - (gradVel[face_P[f]]|PPp) );
#pragma omp critical
{
ap[face_P[f]] += an_N[f];
ap[face_N[f]] += an_P[f];
sc[face_P[f]] += sc_cd + sc_conv_P;
sc[face_N[f]] += -sc_cd + sc_conv_N;
}
You have not declared whether all the other variables in your parallel clause should be shared or not. You can do this generically with the default clause. If no default is specified, the variables are all shared, which is causing the problems in your code.
In your case, I'm guessing you should go for
#pragma omp parallel for default(none), shared(ap,sc,face_N,face_P,cntF,cntE,mf,ifac,Ad,an_P,an_N,SGSvis,dynVis), private(PN,Pf,Nf,PPp,NNp,mf_P,mf_N,SGSf,dynVis_eff,an_diff,an_conv_P,an_conv_N,sc_cd)
I strongly recommend always using default(none) so that the compiler complains every time you don't declare a variable explicitly and forces you to think about it explicitly.