Should I use loops for small iterations? [duplicate] - c++

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.

Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.

As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.

Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.

Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.

Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)

Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.

Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Related

Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?
#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;
void function1() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0;j<n;j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
void function2() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0; j<n; j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0; j<n; j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
int main(int argc, const char * argv[]) {
function1();
function2();
return 0;
}
TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...
Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.
Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).
And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.
Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.
The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.
However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?
Is your real use case copying the same mid-sized array over and over again?
If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?
Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?
Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).
Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.
If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)
And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)
And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.
If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.
This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.
So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.
The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first.
from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.
I'll just chime in here with a little something to keep in mind when looking into performance - unless you are writing embedded software for a real-time device, the performance of such low level code as this should not be a concern.
In 99.9% of all other cases, they will be fast enough.

Is looping faster than traversing one by one

Let us consider the following code snippet in C++ to print the fist 10 positive integers :
for (int i = 1; i<11;i++)
{
cout<< i ;
}
Will this be faster or slower than sequentially printing each integer one by one as follow :
x =1;
cout<< x;
x++;
cout<< x;
And so on ..
Is there any reason as to why it should be faster or slower ? Does it vary from one language to another ?
This question is similar to this one; I've copied an excerpt of my answer to that question below... (the numbers are different; 11 vs. 50; the analysis is the same)
What you're considering doing is a manual form of loop unrolling. Loop unrolling is an optimization that compilers sometimes use for reducing the overhead involved in a loop. Compilers can do it only if the number of iterations of the loop can be known at compile time (i.e. the number of iterations is a constant, even if the constant involves computation based on other constants). In some cases, the compiler may determine that it is worthwhile to unroll the loop, but often it won't unroll it completely. For instance, in your example, the compiler may determine that it would be a speed advantage to unroll the loop from 50 iterations out to only 10 iterations with 5 copies of the loop body. The loop variable would still be there, but instead of doing 50 comparisons of the loop counter, now the code only has to do the comparison 10 times. It's a tradeoff, because the 5 copies of the loop body eat up 5 times as much space in the cache, which means that loading those extra copies of the same instructions forces the cache to evict (throw out) that many instructions that are already in the cache and which you might have wanted to stay in the cache. Also, loading those 4 extra copies of the loop body instructions from main memory takes much, much longer than simply grabbing the already-loaded instructions from the cache in the case where the loop isn't unrolled at all.
So all in all, it's often more advantageous to just use only one copy of the loop body and go ahead and leave the loop logic in place. (I.e. don't do any loop unrolling at all.)
In loop, the actual machine level instruction would be the same, and therefore the same address. In explicit statements, the instructions will have different addresses. So it is possible that for loops, the CPU's instruction cache will provide performance boost that might not happen for the latter case.
For really small range (10) the difference will most likely be negligible. For significant length of the loop it could show up more clearly.

Speeding up gather

I have a computation that produces a coefficient vector and returns the dot product of this vector with a data vector taken from a large array. To speed things up, I do this for eight vectors at a time using AVX2 SIMD intrinsics. The problem is that the bulk of the time ends up being consumed by the gather operation getting the data for the dot product.
I tried different ways of implementing the gather, and the intrinsic seems to work best. I would greatly appreciate some advice on speeding this up.
Here is a sketch:
__m256 Compute(__m256 input)
{
__m256 coefficients[56] = ComputeCoefficients(input);
__m256i indices = ComputeIndices(input);
__m256 sum = _mm256_setzero_ps();
for (size_t i = 0; i != 56; ++i)
{
__m256 data = _mm256_i32gather_ps(bigArray + i, indices, sizeof(float)); // đŸ˜´
sum = _m256_fmadd_ps(coefficients[i], data, sum);
}
return sum;
}
I would first make sure that you are using the most recent Intel processor possible. Intel has invested a lot of engineering in improving the gather instruction.
This being said, it is not magical. If there are cache misses, you will pay a price for them.
I would try to write the same code without SIMD instructions. Is it about the same speed? If it is, then chances are good that your are limited by memory access. Vectorization is good to solve computational limitations, and to write and store data in vector-size units, but even in principle, it cannot be expected to help much with problems bound by random access and cache issues.
Your code repeatedly calls VPGATHERDPS. According to Agner Fog, this instruction has a latency of 12 cycles and a throughput of one instruction every 4 cycles. The latency is, of course, a best-case scenario, cache misses will increase the latency.
You should benchmark your code and ensure that you are close to 4 cycles per loop iteration. The main loop should complete in about 300 cycles, and that's quite fast all things said.
You do not tell us a lot about your problem but we can guess that it is much slower than 300 cycles. If so, then you are probably having cache issues. If your table is large and you are accessing it randomly, then it is a hard problem. If you need better performance, you may need to reengineer the problem.

Accelerate programme with multiple processors

I found that sometimes it's faster to divide one loop into two or more
for (i=0; i<AMT; i++) {
a[i] += c[i];
b[i] += d[i];
}
||
\/
for (i=0; i<AMT; i++) {
//a[i] += c[i];
b[i] += d[i];
}
for (i=0; i<AMT; i++) {
a[i] += c[i];
//b[i] += d[i];
}
On my desktop, win7, AMD Phenom(tm) x6 1055T, the two-loop version runs faster with around 1/3 time less time.
But if I am dealing with assignment,
for (i=0; i<AMT; i++) {
b[i] = rand()%100;
c[i] = rand()%100;
}
dividing the assignment of b and c into two loops is no faster than in one loop.
I think that there are some rules the OS use to determine if certain codes
can be run by multiple processors.
I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will
be automatically (without thread programming) used to speed up my programmes?
It is possible that your compiler is vectorizing the simpler loops. In the assembler output you would see this as the compiled program using SIMD instructions (like Intel's SSE) to process larger chunks of data than one number a time. Automatic vectorization is a hard problem, and it's plausible that the compiler would not be able to vectorize the loop that updates both a and b at the same time. This could partially explain why breaking the complex loop into two would be faster.
In the "assignment" loops, each invocation to rand() depends on the output of the previous invocations, which means that vectorization is inherently impossible. Breaking the loop into two would not make it benefit from SIMD instructions like in the first case, so you wouldn't see it run any faster. Looking at the assembler code the compiler generates would tell you what optimizations the compiler performed and what instructions it used.
Even if the compiler is vectorizing the loop the program is not using more than one CPU or thread; there is no concurrency. What happens is that the one CPU that there is is capable of running the single thread of execution on multiple data points in parallel. The distinction between parallel and concurrent programming is subtle but important.
Cache locality might also explain why breaking the first loop into two makes it run faster, but not why breaking the "assignment" loop into two doesn't. It is possible that b and c in the "assignment" loop are sufficiently small so that they fit into the cache, which would mean that the loop already has optimal performance and breaking it further brings no benefit. If this were the case, making b and c larger would force the loop to start trashing the cache and breaking the loop into two would have the expected benefit.
The optimization is done by the compiler (http://en.wikipedia.org/wiki/Loop_optimization).
if you are using GCC, check this page http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for the list of available optimization rules.
In another hand, see that you are using rand() function which consumes a lot of CPU time.
I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will be automatically (without thread programming) used to speed up my programmes?
No, the guess is not right. In all three cases the code is run on a single core.
It is for some other reason that splitting the first loop into two makes it faster. Perhaps your compiler is able to generate better code, or the CPU is having easier time prefetching the right data, etc. It is hard to tell without analysing the generated machine code.

Comparing forward and reverse loop for int with one limit as 0

Consider the example with for loop:
for(int i = 0; i <= NUM; i++); // forward
for(int i = NUM; i >= 0; i--); // reverse
I tested this loops with gcc (linux-64). Without any optimization flag, forward loop was faster and with optimization to O3/O4, reverse loop was faster.
Somewhere I heard that due to better cache replacement techniques, forward loop is faster.
Personally I think, reverse loop should be faster (whether NUM is a constant or variable). Because any microprocessor will have single instruction for comparison with 0, i >= 0 (i.e. JLZ (jump if less than zero) and equivalent).
Is there any deterministic answer to this ?
No, there is absolutely no deterministic answer for this. You're looking at two different levels of abstraction.
C++ has absolutely nothing to say about what happens under the covers, performance-wise. It specifies a virtual machine which executes C++ code and, while it covers functionality, it does not cover performance of the underlying environment (a).
Which of those is faster will depend on a variety of factors. You may find yourself running on a CPU which makes no distinction between comparing with an arbitrary value and comparing with zero.
You may find an architecture where incrementing a register is ten times faster than decrementing one, bizarre though that may seem.
You may even find a brain-dead architecture that has no decrement, add or subtract instructions at all, and you have to emulate decrement by calling increment 2n-1 times (where n is the word size).
Bottom line: you can't presume to know what's going on under the hood unless you want to look at a very specific CPU, compiler, etc.
You should optimise your code for readability first. If you need to process things in an increasing manner, use the first option. If a decreasing manner, use the latter. If either way seems equally natural, then choose the fastest one, discovered by benchmarking or analysis of the underlying architecture and assembler code. But only do this if you have a specific performance problem, otherwise you're wasting effort.
In any case, since you're almost certainly going to be using i for something, it's likely that whatever tiny increase in performance you get by going the fastest way will be more than swamped by the fact that you now have to calculate NUM-i inside the loop (unless, of course, the compiler is smarter than the developer which, based on what I've seen from gcc, is quite possible).
(a) It does specify certain performance-related things such as the time complexity of some things in the containers library, but not specifically the thing you're asking about, whether forward loops or reverse ones are faster.
The cache replacement techniques only come in effect if there is a conflict. Perhaps NUM isn't big enough for it to have an effect, or perhaps the mapping of virtual to physical memory happens to be favorable for the cache replacement algorithm.
Trying to potentially save a single machine instruction is showing lack of trust for the compiler. If it was that easy, surely the optimizer would know that!
Maybe incrementing a loop variable is so much more common that CPU's branch prediction works better on those.
With the compiler optimization, you loop might be just unrolled—given that I correctly assume, your NUM is a #define constant—and therefore faster.
Although it doesn't really answer your question, but a thought. How about this loop:
int i = NUM + 1;
while ( i --> 0 )//it looks as if i goes to zero (like in calculus)!
{
}