How do "omp single" and "omp task" provide parallelism? - c++

I am confused about omp single and omp task directives. I have read several examples which use both of them. The following example shows how to use the task construct to process elements of a linked list.
1 #pragma omp parallel
2 {
3 #pragma omp single
4 {
5 for(node* p = head; p; p = p->next)
6 {
7 #pragma omp task
8 process(p);
9 }
10 }
11 }
I am failing to understand the parallelism in this example. With omp single, only one thread will execute the structured block related to the single construct, is it right? In this example, Lines 4-10 is the structured block related to the single construct and it can be executed only once, then why we can use omp task inside this structured block? How can it be worked in a parallel manner?

Adding to the other answers, let me dig a bit deeper into what happens during execution.
1 #pragma omp parallel
2 {
3 #pragma omp single
4 {
5 for(node* p = head; p; p = p->next)
6 {
7 #pragma omp task
8 process(p);
9 }
10 } // barrier of single construct
11 }
In the code, I have marked a barrier that is introduced at the end of the single construct.
What happens is this:
First, when encountering the parallel construct, the main thread spawns the parallel region and creates a bunch of worker threads. Then you have n threads running and executing the parallel region.
Second, the single construct picks any one of the n threads and executes the code inside the curly braces of the single construct. All other n-1 threads will proceed to the barrier in line 10. There, they will wait for the last thread to catch up and complete the barrier synchronization. While these threads are waiting there, they are not only wasting time but also wait for work to arrive.
Third, the thread that was picked by the single construct (the "producer") executes the for loop and for each iteration it creates a new task. This task is then put into a task pool so that another thread (one of the ones in the barrier) can pick it up and execute it. Once the producer is done creating tasks, it will join the barrier and if there are still tasks in the task pool waiting for execution, it will help the other threads execute tasks.
Fourth, once all tasks have been generated and executed that way, all threads are done and the barrier synchronization is complete.
I have simplified a bit here and there, as there are more aspects to how an OpenMP implementation can execute tasks, but from a conceptual point of view, the above is what you can think of is happening until you're ready to dive into specific aspects of task scheduling in the OpenMP API.

#pragma omp task schedule a task on the single thread, but it can be executed by other threads. This is one purpose of using OpenMP tasks: the threads providing parallelism are not necessarily the one executing computations in parallel.
Note that tasks are executed at scheduling points. The end of an omp parallel section is a scheduling point. This is why all other OpenMP threads should execute the scheduled task (as long as the tasks live long enough).

So we will go step by step:
When you write the statement #pragma omp parallel, a parallel region creates a team of threads.
Then you wrote #pragma omp single, a single thread then creates the tasks, adding them to a queue that belongs to the team.
Finally you wrote #pragma omp task . The thread that executes this code segment creates a task, which will later be executed, probably by a different thread adding them to a queue that belongs to the team, and all the threads in that team (possibly including the one that generated the tasks). The exact timing of the execution of the task is up to a task scheduler, which operates invisiblee to the user.

Related

Performance issues of multiple independent for loop with openMp

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?
This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.
Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Understanding #pragma omp parallel

I am reading about OpenMP and it sounds amazing. I came at point where the author states that #pragma omp parallel can be used to create a new team of threads. So I wanted to know what difference does #pragma omp parallel mean here. I read that #pragma omp for uses the current team of threads to process a for loop.So I have two examples
First simple example:
#pragma omp for
for(int n=0; n<10; ++n)
{
printf(" %d", n);
}
printf(".\n");
Second example
#pragma omp parallel
{
#pragma omp for
for(int n=0; n<10; ++n) printf(" %d", n);
}
printf(".\n");
My question is are those thread created on the fly every time or once when an application starts also when or why would I want to create a team of more threads ?
Your first example wouldn't compile like that. The "#pragma omp for" advises the compiler to distribute the work load of the following loop within the team of threads which you have to create first. A team of threads is created with the "#pragma omp parallel" statement as you use it in the second example. You can combine the "omp parallel" and "omp for" directives by using "#pragma omp parallel for"
The team of threads are created after the parallel statement and are valid within this block.
TL;DR: The only difference is that the first code calls two implicit barriers whereas the second calls only one.
A more detail answer using as reference the modern official OpenMP 5.1 standard.
The
#pragma omp parallel:
will create a parallel region with a team of threads, where each thread will execute the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The:
#pragma omp parallel for
will create a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for.
Both versions that you have with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
//...
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
À "parallel" region can contains more than a simple "for" loop.
At the 1st time your program meet "parallel" the open MP thread team will be create, after that, every open mp construct will reuse those thread for loop, section, task, etc.....

OpenMP construct to continue execution as soon as at least 1 thread is finished

I have a need to continue execution as soon as one of the threads has finished execution. The logic inside the parallel section with ensure that everything has been completed satisfactorily. I have nested parallelisation therefore I put some of the top level threads to Sleep when data is not ready to be processed as not to consume computation power. So when one of the top level threads finishes I want to continue execution and not wait for the other threads to wake up and naturally return.
I use
#pragma omp parallel for num_threads(wanted_thread_no)
How do you parallelise? Do you use tasks, sections or?
If I understood correct and if you using the task primitive you can use the #pragma omp parallel nowait after the last task.
Check this pdf on page 13 (of the pdf).
http://openmp.org/wp/presos/omp-in-action-SC05.pdf
It explicitly says:
By default, there is a barrier at the end of the “omp for”. Use the
“nowait” clause to turn off the barrier.
#pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops.
Is this what you want?
Also take a look on this as well, even if it says the same thing.
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

Scheduling system for fitting

I would like to parallelise a linear operation (fitting a complicated mathematical function to some dataset) with multiple processors.
Assume I have 8 cores in my machine, and I want to fit 1000 datasets. What I expect is some system that takes the 1000 datasets as a queue, and sends them to the 8 cores for processing, so it starts by taking the first 8 from the 1000 as FIFO. The fitting times of each dataset is in general different than the other, so some of the 8 datasets being fitted could take longer than the others. What I want from the system is to save the results of the fitted data sets, and then resume taking new datasets from the big queue (1000 datasets) for each thread that is done. This has to resume till the whole 1000 datasets is processed. And then I could move on with my program.
What is such a system called? and are there models for that on C++?
I parallelise with OpenMP, and use advanced C++ techniques like templates and polymorphism.
Thank you for any efforts.
You can either use OpenMP parallel for with dynamic schedule or OpenMP tasks. Both could be used to parallelise cases where each iteration takes different amount of time to complete. With dynamically scheduled for:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
schedule(dynamic,1) makes each thread execute one iteration at a time and threads are never left idle unless there are no more iterations to process.
With tasks:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp single
for (int i = 0; i < numFits; i++)
{
#pragma omp task
fitter.fit(..., &results[i]);
}
#pragma omp taskwait
// ^^^ only necessary if more code before the end of the parallel region
}
Here one of the threads runs a for-loop which produces 1000 OpenMP tasks. OMP tasks are kept in a queue and processed by idle threads. It works somewhat similar to dynamic for-loops but allows for greater freedom in the code constructs (e.g. with tasks you can parallelise recursive algorithms). The taskwait construct waits for all pending tasks to be done. It is implied at the end of the parallel region so it is really necessary only if more code follows before the end of the parallel region.
In both cases each invocation to fit() will be done in a different thread. You have to make sure that fitting one set of parameters does not affect fitting other sets, e.g. that fit() is a thread-safe method/function. Both cases also require that the time to execute fit() is much higher than the overhead of the OpenMP constructs.
OpenMP tasking requires OpenMP 3.0 compliant compiler. This rules out all versions of MS VC++ (even the one in VS2012), should you happen to develop on Windows.
If you'd like to have only one instance of fitter ever initialised per thread, then you should take somewhat different approach, e.g. make the fitter object global and threadprivate:
#include <omp.h>
Fitter fitter;
#pragma omp threadprivate(fitter)
...
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
Here fitter is a global instance of the Fitter class. The omp threadprivate directive instructs the compiler to put it in the Thread-Local Storage, e.g. to make it per-thread global variable. These persists between the different parallel regions. You can also use omp threadprivate on static local variables. These too persist between the different parallel regions (but only in the same function):
#include <omp.h>
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
static Fitter fitter; // must be static
#pragma omp threadprivate(fitter)
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
The omp_set_dynamic(0) call disables dynamic teams, i.e. each parallel region will always execute with as many threads as specified by the OMP_NUM_THREADS environment variable.
What you basically want is a pool of workers (or a thread pool) which take a job from a queue, process it, and proceed with another job afterwards. OpenMP provides different approaches to handle such tasks, e.g. barriers (all workers run until a certain point and only proceed when a certain requirement is fulfilled) or reductions to accumulate values into a global variable after the workers managed to compute their respective parts.
Your question is very broad, but one more hint I can give you is to take a look into the MapReduce paradigm. In this paradigm, a function is mapped over a dataset and the result is ordered into buckets which are reduced using another function (which can possibly be the same function again). In your case this would mean that each of your processors/cores/nodes maps a given function over its assigned set of data and sends the result buckets to another node responsible to combine it. I guess that you have to look into MPI if you want to use MapReduce with C++ and without using a specific MapReduce framework. As you are running the program on one node, maybe you can do something similar with OpenMP, so searching the web for that might helps.
TL;DR search for pool of workers (thread pool), barriers and MapReduce.

How do I conditionally terminate a parallel region in OpenMP?

I have an OpenMP with C++ program. There are parallel regions that contain #pragma omp task inside a parallel region. Now, I would like to know how to terminate the parallel region depending on a condition that any of the running threads meet.
#pragma omp parallel
{
#pragma omp task
{
//upon reaching a condition i would like to break out of the parallel region. (all threads should exit this parallel region)
}
}
You can't terminate a parallel construct prematurely. OpenMP has no construct for this and it specifies that parallel regions may have only one exit point (so no branching out of the region...).
I think the only (sane and portable) way to accomplish that is to have a variable which indicates if the work is finished and have the threads check that variable regularly (using atomic instructions and/or flushes to ensure correct visiblity). If the variable indicates that the work is done the threads can skip their remaining work (by putting the remaining work in an if body which isn't branched into if the work is done).
It might be possible to write system specific code, which suspends the other threads and sets them to the end of the block (e.g. manipulating stack and instructionpointers...) but that doesn't seem very advisable (meaning it's probably very brittle).
If you'd tell us a bit more about what you are trying to do (and why you need this), it might be easier to help you (e.g. by prosposing a design which doesn't need to do this).