I have an OpenMP with C++ program. There are parallel regions that contain #pragma omp task inside a parallel region. Now, I would like to know how to terminate the parallel region depending on a condition that any of the running threads meet.
#pragma omp parallel
{
#pragma omp task
{
//upon reaching a condition i would like to break out of the parallel region. (all threads should exit this parallel region)
}
}
You can't terminate a parallel construct prematurely. OpenMP has no construct for this and it specifies that parallel regions may have only one exit point (so no branching out of the region...).
I think the only (sane and portable) way to accomplish that is to have a variable which indicates if the work is finished and have the threads check that variable regularly (using atomic instructions and/or flushes to ensure correct visiblity). If the variable indicates that the work is done the threads can skip their remaining work (by putting the remaining work in an if body which isn't branched into if the work is done).
It might be possible to write system specific code, which suspends the other threads and sets them to the end of the block (e.g. manipulating stack and instructionpointers...) but that doesn't seem very advisable (meaning it's probably very brittle).
If you'd tell us a bit more about what you are trying to do (and why you need this), it might be easier to help you (e.g. by prosposing a design which doesn't need to do this).
Related
I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?
Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.
I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?
This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.
Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?
I am trying to implement a parallel algorithm with OpenMP.
In principle I should have many threads writing and reading different components of a shared vector in an ashyncronous way.
There is a FOR loop in which the threads cycle and when a thread is in, let's say, row A of the loop it writes on a random component of the shared vector, while when it is in row B it reads a random component of the same shared vector.
It may happen that a thread would try to read a component of the shared vector while this component is written by another thread.
How to avoid inconsistency?
I read about locks and critical sections, but I think this is not the solution. For example, I can set a lock around row A in which the threads write in the shared vector, but does this prevent inconsistency if at the same time a thread in row B is trying to read that component?
If the vector modifications are very simple single-value assignment operations and are not actually function calls, what you need are probably atomic reads and writes. With atomic operations, a read from an array element that is simultaneously being written to will either return the new value or the previous value; it will never return some kind of a bit mash of the old and the new value. OpenMP provides the atomic construct for that purpose. On certain architectures, including x86, atomics are far more lightweight than critical sections.
With more complex modifications you have to use critical sections. Those could be either named or anonymous. The latter are created using
#pragma omp critical
code block
Anonymous critical sections all map to the same synchronisation object, no matter what the position of the construct in the source code, therefore it is possible for unrelated code sections to get synchronised with all possible ill effects, like performance degradation or even unexpected deadlocks. That's why it is advisable to always use named critical sections. For example, the following two code segments will not get synchronised:
// -- thread i -- // -- thread j --
... ...
#pragma omp critical(foo) < #pragma omp critical(foo)
do_something(); < do_something;
... ...
#pragma omp critical(bar) #pragma omp critical(bar) <
do_something_else(); do_something_else(); <
... ...
(the code currently being executing by each thread is marked with <)
Note that critical sections bind to all threads of the program, without regard to the team to which the threads belong. It means that even code that executes in different parallel regions (a situation that mainly arises when nested parallelism is used) gets synchronised.
I am writing computational code that more-less has the following schematic:
#pragma omp parallel
{
#pragma omp for nowait
// Compute elements of some array A[i] in parallel
#pragma omp single
for (i = 0; i < N; ++i) {
// Do some operation with A[i].
// This time it is important that operations are sequential. e.g.:
result = compute_new_result(result, A[i]);
}
}
Both computing A[i] and compute_new_result are rather expensive. So my idea is to compute the array elements in parallel and if any of the threads gets free, it starts doing sequential operations. There is a good chance that the starting array elements are already computed and the others will be provided by the other threads doing still the first loop.
However, to make the concept work I have to achieve two things:
To make OpenMP split the loops in alternative way, i.e. for two threads: thread 1 computing A[0], A[2], A[4] and thread 2: A[1], A[3], A[5], etc.
To provide some signaling system. I am thinking about an array of flags indicating that A[i] has already been computed. Then compute_new_result should wait for the flag for respective A[i] to be released before proceeding.
I would be glad for any hints how to achieve both goals. I need the solution to be portable across Linux, Windows and Mac. I am writing the whole code in C++11.
Edit:
I have figured out the answer to the fist question. It looks like it is sufficient do add schedule(static,1) clause to the #pragma omp for directive.
However, I am still thinking on the elegant solution of the second issue...
If you don't mind replacing the OpenMP for worksharing construct with a loop that generates tasks instead, you can use OpenMP task to implement both parts of your application.
In the first loop you would create (instead of the loop chunks), tasks that take on the compute load of the iterations. Each iteration of the second loop then also becomes an OpenMP task. The important part then will be to syncronize the tasks between the different phases.
For that you can use task dependencies (introduce with OpenMP 4.0):
#pragma omp task depend(out:A[0])
{ A[0] = a(); }
#pragma omp task depend(in:A[0])
{ b(A[0]); }
Will make sure that task b does not run before task a has completed.
Cheers,
-michael
This is probably an extended comment rather than an answer ...
So, you have a two-phase computation. In phase 1 you can compute, independently, each entry in your array A. It is straightforward therefore to parallelise this using an OpenMP parallel for loop. But there is an issue here, naive allocations of work to threads are likely to lead to a (severely ?) unbalanced load across threads.
In phase 2 there is a computation which is not so easily parallelised and which you plan to give to the first thread to finish its share of phase 1.
Personally I'd split this into 2 phases. In the first, use a parallel for loop. In the second drop OpenMP and just have a sequential code. Sort out the load balancing within phase 1 by tuning the arguments to a schedule clause; I'd be tempted to try schedule(guided) first.
If tuning the schedule can't provide the balance you want then investigate replacing parallel for by task-ing.
Do not complicate the code for phase 2 by rolling your own signalling technique. I'm not concerned that the complication will overwhelm you, though you might be concerned about that, but that the complication will fail to deliver any benefits unless you sort out the load balance in phase 1. And when you've done that you don't need to put phase2 inside an OpenMP parallel region.
I have a need to continue execution as soon as one of the threads has finished execution. The logic inside the parallel section with ensure that everything has been completed satisfactorily. I have nested parallelisation therefore I put some of the top level threads to Sleep when data is not ready to be processed as not to consume computation power. So when one of the top level threads finishes I want to continue execution and not wait for the other threads to wake up and naturally return.
I use
#pragma omp parallel for num_threads(wanted_thread_no)
How do you parallelise? Do you use tasks, sections or?
If I understood correct and if you using the task primitive you can use the #pragma omp parallel nowait after the last task.
Check this pdf on page 13 (of the pdf).
http://openmp.org/wp/presos/omp-in-action-SC05.pdf
It explicitly says:
By default, there is a barrier at the end of the “omp for”. Use the
“nowait” clause to turn off the barrier.
#pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops.
Is this what you want?
Also take a look on this as well, even if it says the same thing.
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf