parallel_for - Which loop to parallelize? - c++

I have 3-times nested loop, whereas the two outer loops loop only very few times as opposed to the most inner loop. Something like this:
for (int i = 0; i < I; i++) {
for (int j = 0; j < J; j++) {
for (int k = 0; k < K; k++) {
//Do stuff
}
}
}
I ~= J << K, i.e I roughly equals J, but K is very much larger (a factor of few thousands)
Since all of the data are independent of each other, I would like to parallelize them using parallel_for of the ppl.h library. The question now arises, which loop do I parallelize? I'm tending towards the innermost loop, since its the largest, but I assume that every time the outer loops loop, the whole threading overheads starts again. So what is more efficient?

The question now arises, which loop do I parallelize?
Typically, you'd want to parallelize the outermost loop that makes sense. If you parallelize the inner loops, you are introducing extra overhead. By having the "loop bodies" be as large as possible, you'll get better overall throughput. This really boils down to Amdahl's law - in this case, the overhead involved in scheduling the parallel work items is not parallelizable, so the more of that work you do, the lower the potential efficiency overall.
The risk is that, if there are too few items in the outer loop, you may end up where work items can't be run in parallel, since there will be a point where there are fewer items than processing cores in your system.
Provided that your outer loop has enough to keep the cores busy, it's the best place to go - especially if the amount of work done in each loop body is relatively consistent.

Related

How do I translate this ACC code to SYCL?

My question is:
I have this code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?
I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.
Thank you in advance.
You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i:
q.submit([&](sycl::handler &cgh){
cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
})
});
Note that for the inner loops, the kernel will just iterate serially over j in both cases. SYCL doesn't apply any magic to your loops like a #pragma would - loops are loops.
This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i and j: sycl::range<2>(bands, lines_samples). This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.
Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o, your output. This approach maximises available parallelism.
The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:
https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples
https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/reduction.html

Is this Insertion Sort implementation worst case O(n)?

I know that Insertion Sort is supposed to be worst case O(n^2), but I'm wondering why the following implementation isn't O(n).
void main()
{
//insertion sort runs from i = 1 to i = n, thus is worst case O(n)
for (
int i = 1,
placeholder = 0,
A[] = { 10,9,8,7,6,5,4,3,2,1 },
j = i;
i <= 10;
j-- > 0 && A[j - 1] > A[j]
? placeholder = A[j], A[j] = A[j - 1], A[j - 1] = placeholder
: j = ++i
)
{
for (
int x = 0;
x < 10; x++
)
cout << A[x] << ' ';
cout << endl;
}
system("pause");
}
There is only one for loop involved here and it runs from 1 to n. It seems to me that this would be the definition of O(n). What exactly am I missing here?
Sloppy terminology has led many people to false conclusions. This appears to be an example.
There is only one for loop involved here and it runs from 1 to n.
Yes, there is only one loop, but what is this "it" to which you refer? I really do mean for you to think about it. Should "it" refer to the loop? That would match a fairly common, yet sloppy, use of terminology, but a loop does not evaluate to a value. So a loop cannot actually run from one value to another. The sloppiness can be overlooked in simpler contexts, but not in yours.
Normally, the "it" would really refer to the loop control variable. With a simple loop, like for (int i = 0; i < 10; ++i), there is a one-to-one correspondence between iterations of the loop and values assigned to the control variable (which is i in my example). So there is an equivalence present, allowing one to refer to the loop when one really means the control variable. Saying that a loop runs from x to y really means that the control variable runs from x to y, and that there is one iteration of the loop per value assigned to the control variable. This correspondence fails in your code.
In your loop, the thing that runs from 1 to n is i. However, i is not incremented with each iteration of the loop, so "it runs from 1 to n" is not an accurate assessment of your loop. When i is 1, there are up to 2 iterations. That's not a one-to-one correspondence between iterations and values of i. As i increases, the divergence from one-to-one grows. Each value of i potentially corresponds to i+1 iterations, as j counts down from i to 0. The total number of iterations in the worst case scenario for n entries is the sum of the potential number of iterations for each value of i: 2 + 3 + &ctdot; + (n+1) = (n² + 3n)/2. That's O(n²).
Moral of the story: writing compact, cryptic code does not magically change the complexity of the algorithm being implemented. Cryptic code can make the complexity harder to pin down, but the main thing you've accomplished is making your code harder to read.
Thats a very odd way to write code.But You have 2 for loops in the definition. It is not always necessary to have nested loops to have O(n^2), you can have it with recursion also.
In simple terms O(n^2)n simply means number of operations performed when the input size is n.
The code given is not a correct c++ code and not even close to a pseudocode.
The correct code should be like this:
void main()
{
int i,j,key;
int A[]={10,9,8,7,6,5,4,3,2,1};
//cout<<"Array before sorting:"<<endl;
//for(i=0;i<10;i++)
//cout<<A[i]<<"\t";
//cout<<endl;
for(i=1;i<10;i++)
{
key=A[i];
for(j=i-1;j>=0 && A[j]>key;j--)
{
A[j+1]=A[j];
}
A[j+1]=key;
}
//cout<<"Array after sorting:"<<endl;
//for(i=0;i<10;i++)
//cout<<A[i]<<"\t";
//cout<<endl;
}
See, insertion sort has two loops. Outer loop is to maintain the key variable and the inner loop is to compare the elements prior to key variable with the key variable. And therefore the worst case time complexity is O(n^2) and not O(n), as the basic algorithm of insertion sort contains two loops, both of which eventually iterate n times in case of worst case i.e. when the array is inverted.

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

I have
cilk_for (int i = 0; i < 100; i++)
x = fib(35);
the above takes 6.151 seconds
and
for (int i = 0; i < 100; i++)
x = cilk_spawn fib(35);
takes 5.703 seconds
The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.
What don't I understand?
Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.
On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:
The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.
Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.
So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.
In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.