if I want no more tasks to be created if (array length < 100). Is if(r - l >= 100) or final(r - l < 100) satisfying this condition? (l = minIndex; r = maxIndex)
They both work. Relevant parts of the specification are:
undeferred task
A task for which execution is not deferred with respect to its generating task region.
That is, its generating task region is suspended until execution of the undeferred task
is completed.
included task A task for which execution is sequentially included in the generating task region. That is, an included task is undeferred
and executed immediately by the encountering thread.
final task A task that forces all of its child tasks to become final and included tasks.
[...]
When an if clause is present on a task construct, and the if clause
expression evaluates to false, an undeferred task is generated, and
the encountering thread must suspend the current task region, for
which execution cannot be resumed until the generated task is
completed. The use of a variable in an if clause expression of a task
construct causes an implicit reference to the variable in all
enclosing constructs.
When a final clause is present on a task
construct and the final clause expression evaluates to true, the
generated task will be a final task. All task constructs encountered
during execution of a final task will generate final and included
tasks. Note that the use of a variable in a final clause expression of
a task construct causes an implicit reference to the variable in all
enclosing constructs.
----- OpenMP Architecture Review Board. “OpenMP Application Programming Interface.” Specification Version 4.5, November 2015.
This means that if(false) and final(true) both execute task's content immediately. The only difference is if there is another task construct inside your task.
#pragma omp task if(0)
{
// this task is created and executed normally
#pragma omp task
foo();
}
#pragma omp task final(1)
{
// this task is "included", i.e. executed sequentially and immediately
#pragma omp task
foo();
}
From the wording, it also seems that if(false) will create a task and run it immediately, while final will simply run the code sequentially without creating a task. However I'm not sure that this is true, nor that there are performance implications.
Related
I use openomp in my service for parallelizing my loop. But every time when a request came in, my service will create a brand new thread for it, and this thread will use omp to create a thread pool. Can I ask when this thread pool will be detructed?
void foo() {
#pragma omp parallel for schedule(dynamic, 1)
// Do something
}
int main() {
std::vector<std::thread> threads;
for (int i = 0; i < x; i++) {
threads.push_back(
std::thread(foo);
);
}
for (auto& thread : threads) {
thread.join();
}
}
In this pseudo code, I noticed that:
In the for loop, the thread num is 8 * x + 1(8 cores host, some 8 omp threads for each std::thread, and 1 main thread).
After the for loop, the thread num get back to 1, which means all omp thread pools get destructed.
This can be reproduced in this simple code, but for some more complex situation but similar use cases, I noticed the thread pools didn't get destructed after their parent thread finished. So it is hard for me to understand why.
So can I ask when the thread pool of omp will get destructed?
The creation and deletion of the native threads of an OpenMP parallel region is left to the OpenMP implementation (eg. IOMP for ICC/Clang, GOMP for GCC) and is not defined by the OpenMP specification. The specification do not restrict implementations to create native threads at the beginning of a parallel region nor to delete them at the end. In fact, most implementation keep the threads alive as long as possible because creating threads is slow (especially on many-core architectures). The specification explicitly mention the difference between native threads and basic OpenMP threads/tasks (everything is a task in OpenMP 5). Note that OMPT can be used to track when native threads are created and deleted. I expect mainstream implementation to create threads during the runtime initialization (typically when the first parallel section is encountered) and to delete threads when the program ends.
The specification states:
[A native thread is] a thread defined by an underlying thread implementation
If the parallel region creates a native thread, a native-thread-begin event occurs as the first event in the context of the new thread prior to the implicit-task-begin event.
If a native thread is destroyed at the end of a parallel region, a native-thread-end event occurs in the thread as the last event prior to destruction of the thread.
Note that implementations typically destroy and recreate new threads when the number of threads of a parallel region is different from the previous one. This also happens in pathological cases like nesting.
The documentation of GOMP is available here but it is not very detailed. The one of IOMP is available here and is not much better... You can find interesting information directly in the code of the runtimes. For example, in the GOMP code. Note that there are useful comments like:
We only allow the reuse of idle threads for non-nested PARALLEL regions
On the cpp reference website on execution policy there is an example like this:
std::atomic<int> x{0};
int a[] = {1,2};
std::for_each(std::execution::par, std::begin(a), std::end(a), [&](int) {
x.fetch_add(1, std::memory_order_relaxed);
while (x.load(std::memory_order_relaxed) == 1) { } // Error: assumes execution order
});
As you see it is an example of (supposedly) erroneous code. But I do not really understand what the error is here, it does not seem to me that any part of the code assumes the execution order. AFAIK, the first thread to fetch_add will wait for the second one but that's it, no problematic behaviour. Am I missing something and there is some error out there?
The execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel
algorithm's execution may be parallelized. The invocations of element
access functions in parallel algorithms invoked with this policy
(usually specified as std::execution::par) are permitted to execute in
either the invoking thread or in a thread implicitly created by the
library to support parallel algorithm execution. Any such invocations
executing in the same thread are indeterminately sequenced with
respect to each other.
As far as I can see, the issue here is that there is no guarantee on how many threads are used, if the system uses a single thread - there's going to be an endless loop here (while (x.load(std::memory_order_relaxed) == 1) { } never completes).
So I guess the comment means that this codes wrongfully relies on multiple threads executing which would cause fetch_add to be called at some point more than once.
The only guarantee you get is that for each thread, the invocations are not interleaved.
What does it mean in OpenMP that
Nested parallel regions are serialized by default
Does it mean threads do it continuously? I also can not underestend this part:
A throw executed inside a parallel region must cause execution to resume within
the dynamic extent of the same structured block, and it must be caught by the
same thread that threw the exception.
As explained here (scroll down to "17.1 Nested parallelism", by default a nested parallel region will not be parallelized, thus run sequentially. Nested thread creation is possible using either OMP_NESTED=true (as environment variable) or omp_set_nested(1) (in your code).
EDIT: also see this answer to a similar question.
I am using OpenMP successful to parallelize for loops in my c++ code. I tried to
step further and use OpenMP tasks. Unfortunately my code behaves
really strange, so i wrote a minimal example and found a problem.
I would like to define a couple of tasks. Each task should be executed once
by an idle thread.
Unfortunately i can only make all threads execute every task or
only one thread performing all tasks sequentially.
Here is my code which basically runs sequentially:
int main() {
#pragma omp parallel
{
int id, nths;
id = omp_get_thread_num();
#pragma omp single nowait
{
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
}
}
return 0;
}
Only worker 0 shows up and gives his id four times.
I expected to see "My id is 0; My id is 1; my id is 2; my id is 3;
If i delete #pragma omp single i get 16 messages, all threads execute
every single cout.
Is this a problem with my OpenMP setup or did I not get something about
tasks? I am using gcc 6.3.0 on Ubuntu and use -fopenmp flag properly.
Your basic usage of OpenMP tasks (parallel -> single -> task) is correct, you misunderstand the intricacies of data-sharing attributes for variables.
First, you can easily confirm that your tasks are run by different threads by moving omp_get_thread_num() inside the task instead of accessing id.
What happens in your example, id becomes implicitly private within the parallel construct. However, inside the task, it becomes implicitly firstprivate. This means, the task copies the value from the thread that executes the single construct. A more elaborate discussion of a similar issue can be found here.
Note that if you used private within a nested task construct, it would not be the same private variable as the one of the outside parallel construct. Simply said, private does not refer to the thread, but the construct. That's the difference to threadprivate. However, threadprivate is not an attribute to a construct, but it's own directive and only applies to variables with file-scope, namespace-scope or static variables with block-scope.
Anyone know the scope of omp_set_max_active_levels(), assuming function A has a omp parallel region, and within the region, each thread of A makes a call to library function B, and within library function B there are 2 levels of omp parallelism.
Then, if we set active omp level in function A to 3 (1 in A and 2 in B), can that ensure that library function B's parallel region work properly?
if omp_set_max_active_levels() is called from within an active parallel region, then the call will be (should be) ignored.
According to the OpenMP 4.0 standard (section 3.2.15):
When called from a sequential part of the program, the binding
thread set for an omp_set_max_active_levels region is the encountering
thread. When called from within any explicit parallel region, the
binding thread set (and binding region, if required) for the
omp_set_max_active_levels region is implementation defined.
and later on:
This routine has the described effect only when called from a
sequential part of the program. When called from within an explicit
parallel region, the effect of this routine is implementation defined.
Therefore if you set the maximum number of nested parallel region in the sequential part of your program, then you should be ensured that everything will work as expected on any compliant implementation of OpenMP.