Performance problems using OpenMP in nested loops

Performance problems using OpenMP in nested loops - c++

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}

You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/

I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.

The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

Related

Not able to achieve desired speed up using openmp

I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?

Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.

Performance issues of multiple independent for loop with openMp

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?

This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.

Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Signaling in OpenMP

I am writing computational code that more-less has the following schematic:
#pragma omp parallel
{
#pragma omp for nowait
// Compute elements of some array A[i] in parallel
#pragma omp single
for (i = 0; i < N; ++i) {
// Do some operation with A[i].
// This time it is important that operations are sequential. e.g.:
result = compute_new_result(result, A[i]);
}
}
Both computing A[i] and compute_new_result are rather expensive. So my idea is to compute the array elements in parallel and if any of the threads gets free, it starts doing sequential operations. There is a good chance that the starting array elements are already computed and the others will be provided by the other threads doing still the first loop.
However, to make the concept work I have to achieve two things:
To make OpenMP split the loops in alternative way, i.e. for two threads: thread 1 computing A[0], A[2], A[4] and thread 2: A[1], A[3], A[5], etc.
To provide some signaling system. I am thinking about an array of flags indicating that A[i] has already been computed. Then compute_new_result should wait for the flag for respective A[i] to be released before proceeding.
I would be glad for any hints how to achieve both goals. I need the solution to be portable across Linux, Windows and Mac. I am writing the whole code in C++11.
Edit:
I have figured out the answer to the fist question. It looks like it is sufficient do add schedule(static,1) clause to the #pragma omp for directive.
However, I am still thinking on the elegant solution of the second issue...

If you don't mind replacing the OpenMP for worksharing construct with a loop that generates tasks instead, you can use OpenMP task to implement both parts of your application.
In the first loop you would create (instead of the loop chunks), tasks that take on the compute load of the iterations. Each iteration of the second loop then also becomes an OpenMP task. The important part then will be to syncronize the tasks between the different phases.
For that you can use task dependencies (introduce with OpenMP 4.0):
#pragma omp task depend(out:A[0])
{ A[0] = a(); }
#pragma omp task depend(in:A[0])
{ b(A[0]); }
Will make sure that task b does not run before task a has completed.
Cheers,
-michael

This is probably an extended comment rather than an answer ...
So, you have a two-phase computation. In phase 1 you can compute, independently, each entry in your array A. It is straightforward therefore to parallelise this using an OpenMP parallel for loop. But there is an issue here, naive allocations of work to threads are likely to lead to a (severely ?) unbalanced load across threads.
In phase 2 there is a computation which is not so easily parallelised and which you plan to give to the first thread to finish its share of phase 1.
Personally I'd split this into 2 phases. In the first, use a parallel for loop. In the second drop OpenMP and just have a sequential code. Sort out the load balancing within phase 1 by tuning the arguments to a schedule clause; I'd be tempted to try schedule(guided) first.
If tuning the schedule can't provide the balance you want then investigate replacing parallel for by task-ing.
Do not complicate the code for phase 2 by rolling your own signalling technique. I'm not concerned that the complication will overwhelm you, though you might be concerned about that, but that the complication will fail to deliver any benefits unless you sort out the load balance in phase 1. And when you've done that you don't need to put phase2 inside an OpenMP parallel region.

OpenMP and optimising vector operations

I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!

Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.

Why are all iterations in a loop parallelized using OpenMP schedule(dynamic) given to one thread? (MSVS 2010)

Direct Question: I've got a simple loop with, what can be, a computationally intensive function. Let's assume that each iteration takes the same amount of time (so load balancing should be easy).
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for ( int i=0; i < 30; i++ )
{
MyExpensiveFunction();
}
} // parallel block
Why are all of the iterations assigned to a single thread? I can add a:
std::cout << "tID = " << omp_get_thread_num() << "\n\n";
and it prints a bunch of zeros with only the last iteration assigned to thread 1.
My System: I must support cross compiling. So I'm using gcc 4.4.3 & 4.5.0 and they both work as expected, but for MS 2010, I see the above behavior where 29 iterations are assigned to thread 0 and one iteration is assigned to thread 1.
Really Odd: It took me a bit to realize that this might simply be a scheduling problem. I google'd and found this website, which if you skip to the bottom has an example with what must be auto-generated output. All iterations using dynamic and guided scheduling are assigned to thread zero??!?
Any guidance would be greatly appreciated!!

Most likely, this is because the OMP implementation in Visual Studio decided that you did nowhere near enough work to merit putting it on more than one thread. If you simply increase the quantity of iterations, then you may well find that the other threads have more utilization. Dynamic scheduling means that the implementation only forks new threads if it needs them, so if it doesn't need them, it doesn't make them or assign them work.

If each iteration takes the same amount of time, then you actually don't need a dynamic scheduling which causes more scheduling overhead than static scheduling policies. (static, 1) and (static) should be okay.
Could you let me know the length of each iteration? Regarding the example you cited (MSDN's example for schedulings), it is because the amount of work of each iteration is so small, so the first thread just got almost work. If you really increase the work of each iteration (at least an order of millisecond), then you will see the differences.
I did a lot of experiments related to OpenMP scheduling policies. MSVC's implementation of dynamic scheduling works well. I'm pretty sure your work in each iteration was too small.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js