I have a nested loop that looks as follows -
for(int i=0; i<limit1; i++)
{
for(int j=0; j<limit2; j++){
/*
Code with Custom function calls, branching etc etc.
*/
}
}
The inner and outer loop iterations are independent of each other. Since this loop cannot trivially be vectorized, I would like to collapse it into one huge iteration space so that it can be unrolled more efficiently, in the sense that, instead of unrolling only the inner loop, the combined iteration space can be unrolled.
I do not want to parallelize it. From my understanding, I do not think openmp provides a stand alone collapsing pragma like - #pragma omp collapse(2), rather the collapse clause is used in conjunction with other pragmas like simd or for.
Can acheive this via OpenMP, or maybe some other method as well.
PS - I am not sure about the pros and cons of this method, whether one would get a significant performance increase etc. etc. If someone explain how can one (or for that matter if one can) theoretically(i.e. before benchmarking) ascertain this, it would be great.
TIA
First of all, OpenMP is "a model for parallel programming that is portable across architectures from different vendors" (source: OpenMP 5.1 specification). Thus, if you do not want to parallelize the loops, then it is probably not the right tool.
However, please note that OpenMP provide the directive unroll to fully or partially unrolls the outermost loop of a loop nest. The full clause to fully unroll a loop requires the iteration count (e.g. limit1) to be a compile-time constant (it is not theoretically possible otherwise anyway). The partial clause can be parametrized by an integer to control how many iterations are unrolled. As of now, neither GCC, ICC nor Clang support the two clauses.
There is also the tile directive that could reorder the iterations of your two nested loops so to get 2D fixed-size tiles. It does not specify to the compiler that the loop should be unrolled. Nevertheless, an aggressively-optimizing compiler will likely unroll the fixed-sized tile loop nest generally if it does not contain many branches or function calls (since it is often detrimental for the performance). This directive cannot be used in conjunction with unroll (because the tile loops do not have a canonical loop nest form).
Note that collapsing (manually or automatically using a tool) the two nested loops will certainly not help the compiler if the iteration counts are not a compile-time constants, and typically not if limit2 is not a power of two. Indeed, in such a case, using slow modulus by a variable or a rather-slow branch-less conditional move are certainly required. Unrolling such a loop would be very hard for the compiler (if even possible).
Related
Let us consider the following code snippet in C++ to print the fist 10 positive integers :
for (int i = 1; i<11;i++)
{
cout<< i ;
}
Will this be faster or slower than sequentially printing each integer one by one as follow :
x =1;
cout<< x;
x++;
cout<< x;
And so on ..
Is there any reason as to why it should be faster or slower ? Does it vary from one language to another ?
This question is similar to this one; I've copied an excerpt of my answer to that question below... (the numbers are different; 11 vs. 50; the analysis is the same)
What you're considering doing is a manual form of loop unrolling. Loop unrolling is an optimization that compilers sometimes use for reducing the overhead involved in a loop. Compilers can do it only if the number of iterations of the loop can be known at compile time (i.e. the number of iterations is a constant, even if the constant involves computation based on other constants). In some cases, the compiler may determine that it is worthwhile to unroll the loop, but often it won't unroll it completely. For instance, in your example, the compiler may determine that it would be a speed advantage to unroll the loop from 50 iterations out to only 10 iterations with 5 copies of the loop body. The loop variable would still be there, but instead of doing 50 comparisons of the loop counter, now the code only has to do the comparison 10 times. It's a tradeoff, because the 5 copies of the loop body eat up 5 times as much space in the cache, which means that loading those extra copies of the same instructions forces the cache to evict (throw out) that many instructions that are already in the cache and which you might have wanted to stay in the cache. Also, loading those 4 extra copies of the loop body instructions from main memory takes much, much longer than simply grabbing the already-loaded instructions from the cache in the case where the loop isn't unrolled at all.
So all in all, it's often more advantageous to just use only one copy of the loop body and go ahead and leave the loop logic in place. (I.e. don't do any loop unrolling at all.)
In loop, the actual machine level instruction would be the same, and therefore the same address. In explicit statements, the instructions will have different addresses. So it is possible that for loops, the CPU's instruction cache will provide performance boost that might not happen for the latter case.
For really small range (10) the difference will most likely be negligible. For significant length of the loop it could show up more clearly.
I'm new to OpenMP. When I parallelize a for loop using
#pragma omp parallel for num_threads(4)
for(i=0;i<4;i++){
//some parallelizable code
}
Is it guaranteed that every thread takes one and only one value of i? How is the loop work divided among the threads in general when num_threads is not equal to or does not evenly divide the total number of times of the for loop? Is there a command I can use to specify that each thread takes only one value of i, or the number of values of i each thread takes?
The work division in a loop construct is decided by the schedule. If no schedule clause is present, the def-sched-var schedule is used, which is implementation defined.
You could use schedule (static, 1), which in your case guarantees that each thread will get exactly one value.
I highly recommend to take a look at the OpenMP specification, Table 2.5 and 2.7.1.1.
There may be legitimate reasons for making this kind of assumptions, but in general the correctness of your loop code should not depend on this. Primarily I would treat this as a performance-hint.
Depending on your use-case you may want to consider tasks or just parallel constructs. If you rely such details for loops, make sure it is well specified in the standard, and not just works in your particular implementation.
I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.
Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.
As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.
Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.
Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.
Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)
Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.
Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.
Does the c++ compiler take care of cases like, buildings is vector:
for (int i = 0; i < buildings.size(); i++) {}
that is, does it notice if buildings is modified in the loop or not, and then
based on that not evaluate it each iteration? Or maybe I should do this myself,
not that pretty but:
int n = buildings.size();
for (int i = 0; i < n; i++) {}
buildings.size() will likely be inlined by the compiler to directly access the private size field on the vector<T> class. So you shouldn't separate the call to size. This kind of micro-optimization is something you don't want to worry about anyway (unless you're in some really tight loop identified as a bottleneck by profiling).
Don't decide whether to go for one or the other by thinking in terms of performance; your compiler may or may not inline the call - and std::vector::size() has constant complexity, too.
What you should really consider is correctness, because the two versions will behave very differently if you add or remove elements while iterating.
If you don't modify the vector in any way in the loop, stick with the former version to avoid a little bit of state (the n variable).
If the compiler can determine that buildings isn't mutated within the loop (for example if it's a simple loop with no function calls that could have side effects) it will probably optmize the computation away. But computing the size of a vector is a single subtraction anyway which should be pretty cheap as well.
Write the code in the obvious way (size inside the loop) and only if profiling shows you that it's too slow should you consider an alternative mechanism.
I write loops like this:
for (int i = 0, maxI = buildings.size(); i < maxI; ++i)
Takes care of many issues at once: suggest max is fixed up front, no more thinking about lost performance, consolidate types. If evaluation is in the middle expression it suggests the loop changes the collection size.
Too bad language does not allow sensible use of const, else it would be const maxI.
OTOH for more and more cases I rather use some algo, lambda even allows to make it look almost like traditional code.
Assuming the size() function is an inline function for the base-template, one can also assume that it's very little overhead. It is far different from, say, strlen() in C, which can have major overhead.
It is possible that it's still faster to use int n = buildings.size(); - because the compiler can see that n is not changing inside the loop, so load it into a register and not indirectly fetch the vector size. But it's very marginal, and only really tight, highly optimized loops would need this treatment (and only after analyzing and finding that it's a benefit), since it's not ALWAYS that things work as well as you expect in that sort of regard.
Only start to manually optimize stuff like that, if it's really a performance problem. Then measure the difference. Otherwise you'll lot's of unmaintainable ugly code that's harder to debug and less productive to work with. Most leading compilers will probably optimize it away, if the size doesn't change within the loop.
But even if it's not optimized away, then it will probably be inlined (since templates are inlined by default) and cost almost nothing.
I realize that reduction is only usable for POD types in C++. What would you do to implement a reduction for a complex type accumulator?
complex<double> x(0.0,0.0), y(1.0,1.0);
#pragma omp parallel for reduction(+:x)
for(int i=0; i<5; i++)
{
x += y;
}
(noting that I may have left some syntax out). It seems an obvious solution would be to split real and imaginary components into temporary doubles, then accumulate on those. I guess I'm looking for elegance, and that seems ... less than pretty. Would that be the typical approach here?
The typical workaround in absence of user-defined reductions in OpenMP is even uglier than what you suggested. Usually, prior to the parallel region people create an array of (at least) as many elements as there will be threads in the region, accumulate partial results separately for each thread using omp_get_thread_num() as an index to the array, and do final reduction of the accumulated results in a loop after the parallel region.
As far as I know, OpenMP language committee works on adding user-defined reductions to the specification, so maybe it will be finally resolved in a few years.
Sorry, OpenMP simply doesn't support that at this time. Unfortunately, you need to do parallel reduction in an ugly way what you already described.
However, if such parallel reduction is really frequent, I'd like to make a constructor similar to parallel_reduce in TBB. Implementation of such construct is fairly straight forward. Cilk plus has a more powerful reducer object, but I didn't check whether it supports non POD.
FYI, such kind of restriction can also be found in threadprivate pragma. I've tested with VC++ 2008/2010 and Intel compilers (icc). VC++ can't support threadprivate with a struct/class that has a constructor or destructor (or a scalar variable that requires function call to be initialized), by throwing an error: error C3057, "dynamic initialization of 'threadprivate' symbols". You may read this MSDN link as well. However, icc is okay with the case of C3057. You can see, at least, two major implementations are such different.
I guess that supporting parallel reduction on non-POD would have the similar problem above. In order to support parallel reduction, each parallel section should allocate a thread-local variable for a reduction variable. So, if a given reduction variable is non-POD, they may need to call user-defined constructor.This makes the same problem what I have mentioned in the case of C3057.