call a function and loops in parallel - c++

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?

Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

Related

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

Openmp nested for loop with ordered output

I'm currently trying to find a fast and reliable way to parallelize a set of loops with if conditions where I need to save a result in the inner loop.
The code is supposed to go through a huge amount of points in a 3D grid. For some points within this volume I have to check another condition (checking for an angle) and if this condition fulfilled I have to calculate a density.
The fastest ways so far were #pragma omp parallel for private (x,y,z) collapse(3) outside of all for loops or #pragma omp parallel for for the inner most loop (phiInd) which is not only the largest loop but also calls a CPU-intensive function.
I need to store the density value in the densityarr within the inner loop. The densityarray is then later saved seperately.
My problem now is, that depending on the number of threads I set, I get different results in ,y density array. The serial version and an openmp run with just 1 thread have identical results.
Increasing the number of threads leads to results at the same points, but those results are different from the serial version.
I know there is #pragma omp for ordered but this results in a too slow calculation.
Is there a way to parallelize this loop while still getting my results ordered according to my points (x,y,z)?
Or maybe clearer: Why does increasing the thread number change my result?
double phipoint, Rpoint, zpoint;
double phiplane;
double distphi = 2.0 * M_PI / nPlanes; //set desired distace to phi to assign point to fluxtubeplane
double* densityarr = new double[max_x_steps * max_y_steps * max_z_steps];
for (z = 0; z < max_z_steps; z++) {
for (x = 0; x < max_x_steps; x++) {
for (y = 0; y < max_y_steps; y++) {
double x_center = x * stepSizeGrid - max_x / 2;
double y_center = y * stepSizeGrid - max_y / 2;
double z_center = z * stepSizeGrid - max_z / 2;
cartesianCoordinate* pos = new cartesianCoordinate(x_center, y_center, z_center);
linearToroidalCoordinate* tor = linearToroidal(*pos);
simpleToroidalCoordinate* stc = simpleToroidal(*pos);
phipoint = tor->phi;
if (stc->r <= 0.174/*0.175*/) {//check if point is in vessel
for (int phiInd = 0; phiInd < nPlanes; ++phiInd) {
phiplane = phis[phiInd];
if (abs(phipoint - phiplane) <= distphi) {//find right plane for point
Rpoint = tor->R;
zpoint = tor->z;
densityarr[z * max_y_steps * max_x_steps + x * max_y_steps + y] = TubePlanes[phiInd].getMinDistDensity(Rpoint, zpoint);
}
}
}
delete pos, tor, stc;
}
}
}
First, you need to address the errors in your parallel versions. You race-conditions writing to the shared variables phipoint (parallel outer loops) and phiplane,Rpoint,zpoint (any loops parallel). You must declare those private, or better yet, declare them locally in the first place (which makes them implicitly private). That way the code is much easier to reason about - which is very important for parallel codes.
Parallelizing the outer loops like you describe is the obvious and very likely most efficient approach. If there are severe load imbalances (stc->r <= 0.174 not being evenly distributed among the points), you might want to use schedule(dynamic).
Parallelizing the inner loop seems unnecessary in your case. Generally outer loops provide better efficiency because of less overhead - unless they don't expose enough parallel work, have some race conditions, have dependencies, or cache issues. It would however be a worthwhile exercise to try and measure it. However, there may be a race condition upon writing to densityarr, if more than one of the phis satisfy the condition. Overall that loop seems a bit odd - since you only use at most one of the results in densityarr, you could rather reverse the loop and cancel once you found the first one. That helps serial execution a lot, but may inhibit parallelization. Also, if you don't find a phi that satisfies the condition - or if the point is not in the vessel, then the respective entry in densityarr remains uninitialized - that can be very dangerous because you cannot later determine if the value is valid or not.
A general remark, don't allocate objects with new unless you need to. Just put pos on the stack, likely gives you better performance. It can be a performance issue to allocate memory within a (parallel) loop, so you might want to rethink the way you get your Toroidals.
Note that I do assume that TubePlanes[phiInd].getMinDistDensity has no side effects, otherwise parallelization would be problematic.

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

I have
cilk_for (int i = 0; i < 100; i++)
x = fib(35);
the above takes 6.151 seconds
and
for (int i = 0; i < 100; i++)
x = cilk_spawn fib(35);
takes 5.703 seconds
The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.
What don't I understand?
Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.
On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:
The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.
Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.
So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.
In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

parallel_for - Which loop to parallelize?

I have 3-times nested loop, whereas the two outer loops loop only very few times as opposed to the most inner loop. Something like this:
for (int i = 0; i < I; i++) {
for (int j = 0; j < J; j++) {
for (int k = 0; k < K; k++) {
//Do stuff
}
}
}
I ~= J << K, i.e I roughly equals J, but K is very much larger (a factor of few thousands)
Since all of the data are independent of each other, I would like to parallelize them using parallel_for of the ppl.h library. The question now arises, which loop do I parallelize? I'm tending towards the innermost loop, since its the largest, but I assume that every time the outer loops loop, the whole threading overheads starts again. So what is more efficient?
The question now arises, which loop do I parallelize?
Typically, you'd want to parallelize the outermost loop that makes sense. If you parallelize the inner loops, you are introducing extra overhead. By having the "loop bodies" be as large as possible, you'll get better overall throughput. This really boils down to Amdahl's law - in this case, the overhead involved in scheduling the parallel work items is not parallelizable, so the more of that work you do, the lower the potential efficiency overall.
The risk is that, if there are too few items in the outer loop, you may end up where work items can't be run in parallel, since there will be a point where there are fewer items than processing cores in your system.
Provided that your outer loop has enough to keep the cores busy, it's the best place to go - especially if the amount of work done in each loop body is relatively consistent.

OpenMP Thread count question

So Im doing a bit of Parallel Programming of the Trapezoidal Rule for my OS class, this is a homework question but im not looking for source code.
after a bit of research I decided to use each thread to compute a subinterval.
using:
g = (b-a)/n;
integral += (func(a) + func(b))/2.0;
# pragma omp parallel for schedule(static) default(none) \
shared(a, h, n) private(i, x) \
reduction(+: integral) num_threads(thread_count)
for (i = 1; i <= n-1; i++) {
x = a + i*g;
integral += func(x);
}
in my integral function, func(x) is the function that I read in from the file.
So I email my professor to ask how he wants to go about choosing the number of threads. (since they will need to be evenly divisible by N (for the trapezoidal rule)
but he is saying I dont need to define them, and it will define them based on the number of cores on my machine........So needless to say Im a bit confused.
Your professor is correct: OpenMP will choose an optimum number of threads by default, which is usually the number of cores.
You don't need to worry about the number of threads being exactly divisible by N: OpenMP will automatically distribute the iterations among the threads, and if they're not evenly divisible, one thread will end up performing a little more or less work.