Why is my for loop of cilk_spawn doing better than my cilk_for loop? - c++

I have
cilk_for (int i = 0; i < 100; i++)
x = fib(35);
the above takes 6.151 seconds
and
for (int i = 0; i < 100; i++)
x = cilk_spawn fib(35);
takes 5.703 seconds
The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.
What don't I understand?

Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.
On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:
The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.
Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.
So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.
In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

Related

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

operator ++ (prefix) with threads

Betting between friends.
sum variable is defined as global.
and we have 2 threads that run over loop 1..100 and increments sum by 1 every loop.
what will be printed?
"sum="?
int sum = 0;
void func(){
for (int i=0 ; i<= 100; i++){
sum++;
}
}
int main(){
t1 = Thread(func);
t2 = Thread(func);
t1.start();
t2.start();
t1.join();
t2.join();
cout << "sum = " << sum;
return 0;
}
It is undefined behavior so I am gong to say 42. When you have more than one thread accessing a shared variable and at least on of them is a writer then you need synchronization. If you do not have that synchronization then you have undefined behavior and we cannot tell you what will happen.
You could use a std::mutex or you could use a std::atomic to get synchronication and make the programs behavior defined.
There is no single value for sum. If there are 0 race conditions, the value will be 200. If there are race conditions on every iteration of the loop (unlikely) it could be as low as 100. Or it could be anywhere in between.
You probably think of sum++ as an atomic operation, but it is actually syntactic sugar for sum = sum + 1. There is the possibility of a race condition within this operation, so sum could be different every time you run it.
Imagine the current value of sum is 10. Then t1 gets into the loop and reads the value of sum (10), and then is stopped to let t2 begin running. t2 will then reads the same value (10) of sum as t1. Then when each thread increments they will both increment it to 11. If there are no other race conditions the end value of sum would be 199.
Here's an even worse case. Imagine the current value of sum is 10 again. t1 gets into the loop and reads the value of sum (10), then is stopped to let t2 begin running. t2, again, reads the value of sum (10) and then itself is stopped. Now t1 begins running again and it loops through 10 times setting the value of sum to 20. Now t2 starts up again and increments sum to 11, so you've actually decremented the value of sum.
Since incrementation is not atomic, it will result in undefined behaviour.
It will be a random value between 100 and 200. There is a race condition between the two threads without mutual exclusion. So, some ++ operations will be lost. This is why you will get 100 when all ++ operations of a thread are lost, and 200 when nothing is lost. Anything between may happen.

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

parallel_for - Which loop to parallelize?

I have 3-times nested loop, whereas the two outer loops loop only very few times as opposed to the most inner loop. Something like this:
for (int i = 0; i < I; i++) {
for (int j = 0; j < J; j++) {
for (int k = 0; k < K; k++) {
//Do stuff
}
}
}
I ~= J << K, i.e I roughly equals J, but K is very much larger (a factor of few thousands)
Since all of the data are independent of each other, I would like to parallelize them using parallel_for of the ppl.h library. The question now arises, which loop do I parallelize? I'm tending towards the innermost loop, since its the largest, but I assume that every time the outer loops loop, the whole threading overheads starts again. So what is more efficient?
The question now arises, which loop do I parallelize?
Typically, you'd want to parallelize the outermost loop that makes sense. If you parallelize the inner loops, you are introducing extra overhead. By having the "loop bodies" be as large as possible, you'll get better overall throughput. This really boils down to Amdahl's law - in this case, the overhead involved in scheduling the parallel work items is not parallelizable, so the more of that work you do, the lower the potential efficiency overall.
The risk is that, if there are too few items in the outer loop, you may end up where work items can't be run in parallel, since there will be a point where there are fewer items than processing cores in your system.
Provided that your outer loop has enough to keep the cores busy, it's the best place to go - especially if the amount of work done in each loop body is relatively consistent.

performance difference in two almost same loop

I got two almost same loop, but with remarkable difference in performance, both tested with MSVC2010, on system ~2.4 GHZ and 8GB RAM
Below loop take around 2500 ms to execute
for (double count = 0; count < ((2.9*4/555+3/9)*109070123123.8); count++)
;
And this loop execute in less then 1 ms
for (double count = ((2.9*4/555+3/9)*109070123123.8); count >0; --count)
;
What making such huge difference here? One got post increment and other using pre-increment can it result in such huge difference?
You're compiling without optimizations, so the comparison is futile. (If you did have optimizations on, that code would just be cut out completely).
Without optimization, the computation is likely executed at each iteration in the first loop, whereas the second loop only does the computation once, when it first initializes count.
Try changing the first loop to
auto max = ((2.9*4/555+3/9)*109070123123.8);
for (double count = 0; count < max; count++)
;
and then stop profiling debug builds.
In the first loop count < ((2.9*4/555+3/9)*109070123123.8) is computed every time round the loop where as in the second count = ((2.9*4/555+3/9)*109070123123.8) is calculated once and decremented each time round the loop.