OpenMP and sections - c++

i have the following code :
#pragma omp parallel sections num_threads(2) {
#pragma omp section
Function_1;
#pragma omp section
Function_2;
}
but within the Function_1 and Function_2, i have a parallel for but just one thread run it.
So, how run the Function_1 and Function_2 in parallel and run several threads within these functions?
thx!

Having one parallel region inside another is called nesting. By default nested regions are inactive, which means that they execute serially. In order to make them active, you can:
set the environment variable OMP_NESTED to true
insert the following call before the enclosing parallel region: omp_set_nested(1);
One can also limit the number of levels, where nested parallelism works, by:
setting the environment variable OMP_MAX_ACTIVE_LEVELS to num, or
calling omp_set_max_active_levels(num);
where num is the desired maximum active level, e.g. a value of 3 would render all parallel regions, nested more than 3 levels deep, inactive.

Related

How do "omp single" and "omp task" provide parallelism?

I am confused about omp single and omp task directives. I have read several examples which use both of them. The following example shows how to use the task construct to process elements of a linked list.
1 #pragma omp parallel
2 {
3 #pragma omp single
4 {
5 for(node* p = head; p; p = p->next)
6 {
7 #pragma omp task
8 process(p);
9 }
10 }
11 }
I am failing to understand the parallelism in this example. With omp single, only one thread will execute the structured block related to the single construct, is it right? In this example, Lines 4-10 is the structured block related to the single construct and it can be executed only once, then why we can use omp task inside this structured block? How can it be worked in a parallel manner?
Adding to the other answers, let me dig a bit deeper into what happens during execution.
1 #pragma omp parallel
2 {
3 #pragma omp single
4 {
5 for(node* p = head; p; p = p->next)
6 {
7 #pragma omp task
8 process(p);
9 }
10 } // barrier of single construct
11 }
In the code, I have marked a barrier that is introduced at the end of the single construct.
What happens is this:
First, when encountering the parallel construct, the main thread spawns the parallel region and creates a bunch of worker threads. Then you have n threads running and executing the parallel region.
Second, the single construct picks any one of the n threads and executes the code inside the curly braces of the single construct. All other n-1 threads will proceed to the barrier in line 10. There, they will wait for the last thread to catch up and complete the barrier synchronization. While these threads are waiting there, they are not only wasting time but also wait for work to arrive.
Third, the thread that was picked by the single construct (the "producer") executes the for loop and for each iteration it creates a new task. This task is then put into a task pool so that another thread (one of the ones in the barrier) can pick it up and execute it. Once the producer is done creating tasks, it will join the barrier and if there are still tasks in the task pool waiting for execution, it will help the other threads execute tasks.
Fourth, once all tasks have been generated and executed that way, all threads are done and the barrier synchronization is complete.
I have simplified a bit here and there, as there are more aspects to how an OpenMP implementation can execute tasks, but from a conceptual point of view, the above is what you can think of is happening until you're ready to dive into specific aspects of task scheduling in the OpenMP API.
#pragma omp task schedule a task on the single thread, but it can be executed by other threads. This is one purpose of using OpenMP tasks: the threads providing parallelism are not necessarily the one executing computations in parallel.
Note that tasks are executed at scheduling points. The end of an omp parallel section is a scheduling point. This is why all other OpenMP threads should execute the scheduled task (as long as the tasks live long enough).
So we will go step by step:
When you write the statement #pragma omp parallel, a parallel region creates a team of threads.
Then you wrote #pragma omp single, a single thread then creates the tasks, adding them to a queue that belongs to the team.
Finally you wrote #pragma omp task . The thread that executes this code segment creates a task, which will later be executed, probably by a different thread adding them to a queue that belongs to the team, and all the threads in that team (possibly including the one that generated the tasks). The exact timing of the execution of the task is up to a task scheduler, which operates invisiblee to the user.

Understanding #pragma omp parallel

I am reading about OpenMP and it sounds amazing. I came at point where the author states that #pragma omp parallel can be used to create a new team of threads. So I wanted to know what difference does #pragma omp parallel mean here. I read that #pragma omp for uses the current team of threads to process a for loop.So I have two examples
First simple example:
#pragma omp for
for(int n=0; n<10; ++n)
{
printf(" %d", n);
}
printf(".\n");
Second example
#pragma omp parallel
{
#pragma omp for
for(int n=0; n<10; ++n) printf(" %d", n);
}
printf(".\n");
My question is are those thread created on the fly every time or once when an application starts also when or why would I want to create a team of more threads ?
Your first example wouldn't compile like that. The "#pragma omp for" advises the compiler to distribute the work load of the following loop within the team of threads which you have to create first. A team of threads is created with the "#pragma omp parallel" statement as you use it in the second example. You can combine the "omp parallel" and "omp for" directives by using "#pragma omp parallel for"
The team of threads are created after the parallel statement and are valid within this block.
TL;DR: The only difference is that the first code calls two implicit barriers whereas the second calls only one.
A more detail answer using as reference the modern official OpenMP 5.1 standard.
The
#pragma omp parallel:
will create a parallel region with a team of threads, where each thread will execute the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The:
#pragma omp parallel for
will create a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for.
Both versions that you have with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
//...
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
À "parallel" region can contains more than a simple "for" loop.
At the 1st time your program meet "parallel" the open MP thread team will be create, after that, every open mp construct will reuse those thread for loop, section, task, etc.....

OpenMP construct to continue execution as soon as at least 1 thread is finished

I have a need to continue execution as soon as one of the threads has finished execution. The logic inside the parallel section with ensure that everything has been completed satisfactorily. I have nested parallelisation therefore I put some of the top level threads to Sleep when data is not ready to be processed as not to consume computation power. So when one of the top level threads finishes I want to continue execution and not wait for the other threads to wake up and naturally return.
I use
#pragma omp parallel for num_threads(wanted_thread_no)
How do you parallelise? Do you use tasks, sections or?
If I understood correct and if you using the task primitive you can use the #pragma omp parallel nowait after the last task.
Check this pdf on page 13 (of the pdf).
http://openmp.org/wp/presos/omp-in-action-SC05.pdf
It explicitly says:
By default, there is a barrier at the end of the “omp for”. Use the
“nowait” clause to turn off the barrier.
#pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops.
Is this what you want?
Also take a look on this as well, even if it says the same thing.
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

Scheduling system for fitting

I would like to parallelise a linear operation (fitting a complicated mathematical function to some dataset) with multiple processors.
Assume I have 8 cores in my machine, and I want to fit 1000 datasets. What I expect is some system that takes the 1000 datasets as a queue, and sends them to the 8 cores for processing, so it starts by taking the first 8 from the 1000 as FIFO. The fitting times of each dataset is in general different than the other, so some of the 8 datasets being fitted could take longer than the others. What I want from the system is to save the results of the fitted data sets, and then resume taking new datasets from the big queue (1000 datasets) for each thread that is done. This has to resume till the whole 1000 datasets is processed. And then I could move on with my program.
What is such a system called? and are there models for that on C++?
I parallelise with OpenMP, and use advanced C++ techniques like templates and polymorphism.
Thank you for any efforts.
You can either use OpenMP parallel for with dynamic schedule or OpenMP tasks. Both could be used to parallelise cases where each iteration takes different amount of time to complete. With dynamically scheduled for:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
schedule(dynamic,1) makes each thread execute one iteration at a time and threads are never left idle unless there are no more iterations to process.
With tasks:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp single
for (int i = 0; i < numFits; i++)
{
#pragma omp task
fitter.fit(..., &results[i]);
}
#pragma omp taskwait
// ^^^ only necessary if more code before the end of the parallel region
}
Here one of the threads runs a for-loop which produces 1000 OpenMP tasks. OMP tasks are kept in a queue and processed by idle threads. It works somewhat similar to dynamic for-loops but allows for greater freedom in the code constructs (e.g. with tasks you can parallelise recursive algorithms). The taskwait construct waits for all pending tasks to be done. It is implied at the end of the parallel region so it is really necessary only if more code follows before the end of the parallel region.
In both cases each invocation to fit() will be done in a different thread. You have to make sure that fitting one set of parameters does not affect fitting other sets, e.g. that fit() is a thread-safe method/function. Both cases also require that the time to execute fit() is much higher than the overhead of the OpenMP constructs.
OpenMP tasking requires OpenMP 3.0 compliant compiler. This rules out all versions of MS VC++ (even the one in VS2012), should you happen to develop on Windows.
If you'd like to have only one instance of fitter ever initialised per thread, then you should take somewhat different approach, e.g. make the fitter object global and threadprivate:
#include <omp.h>
Fitter fitter;
#pragma omp threadprivate(fitter)
...
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
Here fitter is a global instance of the Fitter class. The omp threadprivate directive instructs the compiler to put it in the Thread-Local Storage, e.g. to make it per-thread global variable. These persists between the different parallel regions. You can also use omp threadprivate on static local variables. These too persist between the different parallel regions (but only in the same function):
#include <omp.h>
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
static Fitter fitter; // must be static
#pragma omp threadprivate(fitter)
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
The omp_set_dynamic(0) call disables dynamic teams, i.e. each parallel region will always execute with as many threads as specified by the OMP_NUM_THREADS environment variable.
What you basically want is a pool of workers (or a thread pool) which take a job from a queue, process it, and proceed with another job afterwards. OpenMP provides different approaches to handle such tasks, e.g. barriers (all workers run until a certain point and only proceed when a certain requirement is fulfilled) or reductions to accumulate values into a global variable after the workers managed to compute their respective parts.
Your question is very broad, but one more hint I can give you is to take a look into the MapReduce paradigm. In this paradigm, a function is mapped over a dataset and the result is ordered into buckets which are reduced using another function (which can possibly be the same function again). In your case this would mean that each of your processors/cores/nodes maps a given function over its assigned set of data and sends the result buckets to another node responsible to combine it. I guess that you have to look into MPI if you want to use MapReduce with C++ and without using a specific MapReduce framework. As you are running the program on one node, maybe you can do something similar with OpenMP, so searching the web for that might helps.
TL;DR search for pool of workers (thread pool), barriers and MapReduce.

How do I conditionally terminate a parallel region in OpenMP?

I have an OpenMP with C++ program. There are parallel regions that contain #pragma omp task inside a parallel region. Now, I would like to know how to terminate the parallel region depending on a condition that any of the running threads meet.
#pragma omp parallel
{
#pragma omp task
{
//upon reaching a condition i would like to break out of the parallel region. (all threads should exit this parallel region)
}
}
You can't terminate a parallel construct prematurely. OpenMP has no construct for this and it specifies that parallel regions may have only one exit point (so no branching out of the region...).
I think the only (sane and portable) way to accomplish that is to have a variable which indicates if the work is finished and have the threads check that variable regularly (using atomic instructions and/or flushes to ensure correct visiblity). If the variable indicates that the work is done the threads can skip their remaining work (by putting the remaining work in an if body which isn't branched into if the work is done).
It might be possible to write system specific code, which suspends the other threads and sets them to the end of the block (e.g. manipulating stack and instructionpointers...) but that doesn't seem very advisable (meaning it's probably very brittle).
If you'd tell us a bit more about what you are trying to do (and why you need this), it might be easier to help you (e.g. by prosposing a design which doesn't need to do this).