OpenMP Thread count question - c++

So Im doing a bit of Parallel Programming of the Trapezoidal Rule for my OS class, this is a homework question but im not looking for source code.
after a bit of research I decided to use each thread to compute a subinterval.
using:
g = (b-a)/n;
integral += (func(a) + func(b))/2.0;
# pragma omp parallel for schedule(static) default(none) \
shared(a, h, n) private(i, x) \
reduction(+: integral) num_threads(thread_count)
for (i = 1; i <= n-1; i++) {
x = a + i*g;
integral += func(x);
}
in my integral function, func(x) is the function that I read in from the file.
So I email my professor to ask how he wants to go about choosing the number of threads. (since they will need to be evenly divisible by N (for the trapezoidal rule)
but he is saying I dont need to define them, and it will define them based on the number of cores on my machine........So needless to say Im a bit confused.

Your professor is correct: OpenMP will choose an optimum number of threads by default, which is usually the number of cores.
You don't need to worry about the number of threads being exactly divisible by N: OpenMP will automatically distribute the iterations among the threads, and if they're not evenly divisible, one thread will end up performing a little more or less work.

Related

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

Parallelizing nested loop with OpenMP

I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method.
I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here:
!$omp parallel do private(i) shared(bodyArray, n) default(none)
do i = 1, n
!acc is real(real64), dimension(3)
bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
end do
which works just fine.
What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop:
call clearAcceleration(bodyArray) !zero out acceleration
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line:
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
and that the forces are not being added up properly with all the different 'j' writing to it.
I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results.
Is there an easy fix to get this loop working with OpenMP?
Working code, it has a slight speed improvement but not the ~2x I was looking for.
!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
do i = 1, n
do j = 1, i-1
forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
forces(i, j)%vec = -forces(j, i)%vec
end do
end do
!$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
do i = 1, n
do j = 1, n
bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
end do
end do
With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
[Actually, before you consider it, revise it as follows ...
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i+1, n
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end do
end do
..., now back to the issues]
There are two problems here:
As you've already twigged, you've got a data race updating bodyArray(j)%acc; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using critical sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.
The pattern of access to elements of bodyArray is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.
Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet.
First, create a 2D array of reals, call it forces, where element force(i,j) is the force that element i exerts on j. Then, some code like this (untested, that's your responsibility if you care to follow this line)
forces = 0.0 ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
do i = 1, n
do j = 1, i-1
forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
end do
end do
then sum the forces on each particle (and get the following right, I haven't checked carefully)
!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
bodyArray(i)%acc = sum(forces(:,i))
end do
As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time.
Now what you have is, probably, a problem with load balancing in the loop nest over forces. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first n/num_threads rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a schedule(dynamic) clause to the parallel directive, you might have to work a bit harder to balance the load.
You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time.
Another approach would be to pack the lower (or upper) triangle of forces into a rank-1 array and use some fancy indexing arithmetic to transform 2D (i,j) indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

Multithread C++ program to speed up a summatory loop

I have a loop that iterates from 1 to N and takes a modular sum over time. However N is very large and so I am wondering if there is a way to modify it by taking advantage of multithread.
To give sample program
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
f(i) in my case isn't an actual function, but a long expression that would take up room here. Putting it there to illustrate purpose.
Yes, try this:
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Compile with:
g++ -fopenmp your_program.c
It's that simple! No headers are required. The #pragma line automatically spins up a couple of threads, divides the iterations of the loop evenly, and then recombines everything after the loop. Note though, that you must know the number of iterations beforehand.
This code uses OpenMP, which provides easy-to-use parallelism that's quite suitable to your case. OpenMP is even built-in to the GCC and MSVC compilers.
This page shows some of the other reduction operations that are possible.
If you need nested for loops, you can just write
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
for (long long j = 1; j < N; ++j)
total = (total + f(i)*j) % modulus;
And the outer loop will be parallelised, with each thread running its own copy of the inner loop.
But you could also use the collapse directive:
#pragma omp parallel for reduction(+:total) collapse(2)
and then the iterations of both loops will be automagically divied up.
If each thread needs its own copy of a variable defined prior to the loop, use the private command:
double total=0, cheese=4;
#pragma omp parallel for reduction(+:total) private(cheese)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Note that you don't need to use private(total) because this is implied by reduction.
As presumably the f(i) are independent but take the same time roughly to run, you could create yourself 4 threads, and get each to sum up 1/4 of the total, then return the sum as a value, and join each one. This isn't a very flexible method, especially if the times the f(i) times can be random.
You might also want to consider a thread pool, and make each thread calculate f(i) then get the next i to sum.
Try openMP's parallel for with the reduction clause for your total http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause
If f(long long int) is a function that solely relies on its input and no global state and the abelian properties of addition hold, you can gain a significant advantage like this:
for(long long int i = 0, j = 1; i < N; i += 2, j += 2)
{
total1 = (total1 + f(i)) % modulus;
total2 = (total2 + f(j)) % modulus;
}
total = (total1 + total2) % modulus;
Breaking this out like that should help by allowing the compiler to improve code generation and the CPU to use more resources (the two operations can be handled in parallel) and pump more data out and reduce stalls. [I am assuming an x86 architecture here]
Of course, without knowing what f really looks like, it's hard to be sure if this is possible or if it will really help or make a measurable difference.
There may be other similar tricks that you can exploit special knowledge of your input and your platform - for example, SSE instructions could allow you to do even more. Platform-specific functionality might also be useful. For example, a modulo operation may not be required at all and your compiler may provide a special intrinsic function to perform addition modulo N.
I must ask, have you profiled your code and found this to be a hotspot?
You could use Threading Building Blocks
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i)) % modulus;
});
Or whitout overflow checks:
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i));
});
total %= modulus;

openMP histogram comparison

I am working on the code that compares image histograms, buy calculating correlation, intersection, ChiSquare and few other methods. General look of these functions are very similar to each other.
Usually I working with pthreads, but this time I decided to build small prototype with openMP (due to it simplicity) and see what kind of results I will get.
This is example of comparing by correlation, code is identical to serial implementation except single line of openMP loop.
double comp(CHistogram* h1, CHistogram* h2){
double Sa = 0;
double Sb = 0;
double Saa = 0;
double Sbb = 0;
double Sab = 0;
double a, b;
int N = h1->length;
#pragma omp parallel for reduction(+:Sa,Sb,Saa,Sbb,Sab) private(a ,b)
for (int i = 0; i<N;i++){
a =h1->data[i];
b =h2->data[i];
Sa+=a;
Sb+=b;
Saa+=a*a;
Sbb+=b*b;
Sab+=a*b;
}
double sUp = Sab - Sa*Sb / N;
double sDown = (Saa-Sa*Sa / N)*(Sbb-Sb*Sb / N);
return sUp / sqrt(sDown);
}
Are there more ways to speed up this function with openMP ?
Thanks!
PS: I know that fastest way would be just to compare different pairs of histograms across multiple threads, but this is not applicable to my situation since only 2 histograms are available at a time.
Tested on quad core machine
I have a little bit of uncertainty, on a longer run openmp seems to perform better than a serial. But if I compare it just for a single histogram and measure time in useconds, then serial is faster in about 20 times.
I guess openmp puts some optimization once it see outside for loop. But in a real solution I will have some code in between histogram comparisons, and I not sure if it will behave the same way.
OpenMp takes some time to set up the parallel region. This overhead means you need to be careful that the overhead isn't greater than the performance that is gained by setting up a parallel region. In your case this means that only when N reaches a certain number will openMP speed up the calculation.
You should think about ways to reduce the total number of openMP calls, for instance is it possible to set up a parallel region outside this function so that you compare different histograms in parallel?