the behavior of omp critical with nested level of parallelism - c++

Considering the following scenario:
Function A creates a layer of OMP parallel region, and each OMP thread make a call to a function B, which itself contain another layer of OMP parallel region.
Then if within the parallel region of function B, there is a OMP critcal region, then, does that region is critical "globally" with respect to all threads created by function A and B, or it is merely locally to function B?
And what if B is a pre-bulit function (e.g. static or dynamic linked libraries)?

Critical regions in OpenMP have global binding and their scope extends to all occurrences of the critical construct that have the same name (in that respect all unnamed constructs share the same special internal name), no matter where they occur in the code. You can read about the binding of each construct in the corresponding Binding section of the OpenMP specification. For the critical construct you have:
The binding thread set for a critical region is all threads. Region execution is restricted to a single thread at a time among all the threads in the program, without regard to the team(s) to which the threads belong.
(HI: emphasis mine)
That's why it is strongly recommended that named critical regions should be used, especially if the sets of protected resources are disjoint, e.g.:
// This one located inside a parallel region in fun1
#pragma omp critical(fun1)
{
// Modify shared variables a and b
}
...
// This one located inside a parallel region in fun2
#pragma omp critical(fun2)
{
// Modify shared variables c and d
}
Naming the regions eliminates the chance that two unrelated critical construct could block each other.
As to the second part of your question, to support the dynamic scoping requirements of the OpenMP specification, critical regions are usually implemented with named mutexes that are resolved at run-time. Therefore it is possible to have homonymous critical regions in a prebuilt library function and in your code and it will work as expected as long as both codes are using the same OpenMP runtime, e.g. both were built using the same compiler suite. Cross-suite OpenMP compatibility is usually not guaranteed. Also if in B() there is an unnamed critical region, it will interfere with all unnamed critical regions in the rest of the code, no matter if they are part the same library code of belong to the user code.

Related

Differences between Shared and Private in OpenMP (C++)

I am trying to parallelize my C++ code using OpenMP.
So this is my first time with OpenMP and I have a couple of questions about how to use private / shared properly
Below is just a sample code I wrote to understand what is going on. Correct me if I am wrong.
#pragma omp parallel for
for (int x=0;x<100;x++)
{
for (int y=0;y<100;y++)
{
for (int z=0;z<100;z++)
{
a[x][y][z]=U[x]+U[y]+U[z];
}
}
}
So by using #pragma omp parallel for I can use multiple threads to do this loop i.e with 5 threads, #1 thread use 0<=x<20, #2 thread use 20<=x<40 ... 80 <=x<100.
And each thread runs at the same time. So by using this, I can make this code faster.
Since x, y, and z are declared inside the loop, they are private (each thread will have a copy version of these variables), a and U are shared.
So each thread reads a shared variable U and writes to a shared variable a.
I have a couple of questions.
What would be the difference between #pragma omp parallel for and #pragma omp parallel for private(y,z)? I think since x, y, and z are already private, they should be the same.
If I use #pragma omp parallel for private(a, U), does this mean each thread will have a copy of a and U?
For example, with 2 threads that have a copy of a and U, thread #1 use 0<=x<50 so that it writes from a[0][0][0] to a[49][99][99] and thread #2 writes from a[50][0][0] to a[99][99][99]. And after that they merge these two results so that they have complete version of a[x][y][z]?
Any variable declared within a parallel block will be private. Variables mentioned in the private clause of a parallel directive follow the normal rules for variables: the variable must already be declared at the point it is used.
The effect of private is to create a copy of the variable for each thread. Then the threads can update the value without worrying about changes that could be made by other threads. At the end of the parallel block, the values are generally lost unless there are other clauses included in the parallel directive. The reduction directive is the most common, as it can combine the results from each thread into a final result for the loop.

Nested paralleled regions OpenMP

What does it mean in OpenMP that
Nested parallel regions are serialized by default
Does it mean threads do it continuously? I also can not underestend this part:
A throw executed inside a parallel region must cause execution to resume within
the dynamic extent of the same structured block, and it must be caught by the
same thread that threw the exception.
As explained here (scroll down to "17.1 Nested parallelism", by default a nested parallel region will not be parallelized, thus run sequentially. Nested thread creation is possible using either OMP_NESTED=true (as environment variable) or omp_set_nested(1) (in your code).
EDIT: also see this answer to a similar question.

Usage of openMP Shared clause in C++

According to this
All variables defined outside a parallel construct become shared when the parallel region is encountered.
I am wondering what would be the usage of openMP Shared clause while developing in C++.
Even if variables are shared by default, the default can be changed by the default() clause. When you have default(none) or default(private) you have to declare shared variables explicitly.
There many many uses for shared variables.
A large array is typically used shared and different threads are operating on a different part of the array.
Or a configuration parameter which you are only reading, not modifying, that can be shared.
Or a global variable defining some state or a flag even if you are changing that under some condition. You would have it shared and change it in a critical or single section.

How to Reuse OMP Thread Pool, Created by Main Thread, in Worker Thread?

Near the start of my c++ application, my main thread uses OMP to parallelize several for loops. After the first parallelized for loop, I see that the threads used remain in existence for the duration of the application, and are reused for subsequent OMP for loops executed from the main thread, using the command (working in CentOS 7):
for i in $(pgrep myApplication); do ps -mo pid,tid,fname,user,psr -p $i;done
Later in my program, I launch a boost thread from the main thread, in which I parallelize a for loop using OMP. At this point, I see an entirely new set of threads are created, which has a decent amount of overhead.
Is it possible to make the OMP parallel for loop within the boost thread reuse the original OMP thread pool created by the main thread?
Edit: Some pseudo code:
myFun(data)
{
// Want to reuse OMP thread pool from main here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// Work on data
}
}
main
{
// Thread pool created here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// do stuff
}
boost::thread myThread(myFun) // Constructor starts thread.
// Do some serial stuff, no OMP.
myThread.join();
}
The interaction of OpenMP with other threading mechanisms is deliberately left out of the specification and is therefore dependent heavily on the implementation. The GNU OpenMP runtime keeps a pointer to the thread pool in TLS and propagates it down the (nested) teams. Threads started via pthread_create (or boost::thread or std::thread) do not inherit the pointer and therefore spawn a fresh pool. It is probably the case with other OpenMP runtimes too.
There is a requirement in the standard that basically forces such behaviour in most implementations. It is about the semantics of the threadprivate variables and how their values are retained across the different parallel regions forked from the same thread (OpenMP standard, 2.15.2 threadprivate Directive):
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all of the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.
This, besides performance, is probably the main reason for using thread pools in OpenMP runtimes.
Now, imagine that two parallel regions forked by two separate threads share the same worker thread pool. A parallel region was forked by the first thread and some threadprivate variables were set. Later a second parallel region is forked by the same thread, where those threadprivate variables are used. But somewhere between the two parallel regions, a parallel region is forked by the second thread and worker threads from the same pool are utilised. Since most implementations keep threadprivate variables in TLS, the above semantics can no longer be asserted. A possible solution would be to add new worker threads to the pool for each separate thread, which is not much different than creating new thread pools.
I'm not aware of any workarounds to make the worker thread pool shared. And if possible, it will not be portable, therefore the main benefit of OpenMP will be lost.

omp_set_max_active_levels() and function call

Anyone know the scope of omp_set_max_active_levels(), assuming function A has a omp parallel region, and within the region, each thread of A makes a call to library function B, and within library function B there are 2 levels of omp parallelism.
Then, if we set active omp level in function A to 3 (1 in A and 2 in B), can that ensure that library function B's parallel region work properly?
if omp_set_max_active_levels() is called from within an active parallel region, then the call will be (should be) ignored.
According to the OpenMP 4.0 standard (section 3.2.15):
When called from a sequential part of the program, the binding
thread set for an omp_set_max_active_levels region is the encountering
thread. When called from within any explicit parallel region, the
binding thread set (and binding region, if required) for the
omp_set_max_active_levels region is implementation defined.
and later on:
This routine has the described effect only when called from a
sequential part of the program. When called from within an explicit
parallel region, the effect of this routine is implementation defined.
Therefore if you set the maximum number of nested parallel region in the sequential part of your program, then you should be ensured that everything will work as expected on any compliant implementation of OpenMP.