For loop performance difference, and compiler optimization - c++

I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.
Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.
I was experimenting with for-loops performance when I stumbled upon the following case.
I have the following code that does the same computation in two different ways.
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
Compiling this code with
g++ main.cpp -o output --std=c++11
leads to the following result:
=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
My first question is: why is the second loop running 23% faster than the first?
On the other hand, if I compile the code with
g++ main.cpp -o output --std=c++11 -O1
The results improve a lot,
=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
and the difference in time almost disappears.
But I could not believe my eyes when I set the -O2 flag,
g++ main.cpp -o output --std=c++11 -O2
and got this:
=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?
I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.
By the way, I am compiling this code with g++ (GCC) 4.9.1.
EDIT to confirm Basile Starynkevitch's assumption
I edited the code, now main looks like this:
int main(int argc, char **argv)
{
int start = atoi(argv[1]);
int end = atoi(argv[2]);
int delta = end - start + 1;
std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
double elapsed;
std::printf("=====================================================\n");
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationB(start, delta);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationA(start, end);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
std::printf("=====================================================\n");
return 0;
}
These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.
I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.
Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.

My immediate guess would be that the second is faster, not because of the changes you made to the loop, but because it's second, so the cache is already primed when it runs.
To test the theory, I re-arranged your code to reverse the order in which the two calculations were called:
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
The result I got was:
=====================================================
Elapsed time: 0.286 s | 286.000 ms | 286000.000 us
Elapsed time: 0.271 s | 271.000 ms | 271000.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, when version A runs first, it's slower. When version B run's first, it's slower.
To confirm, I added an extra call to superCalculationB before doing the timing on either version A or B. After that, I tried running the program three times. For those three runs, I'd judge the results a tie (version A was faster once and version B was faster twice, but neither won dependably nor by a wide enough margin to be meaningful).
That doesn't prove that it's actually a cache situation as such, but does give a pretty strong indication that it's a matter of the order in which the functions are called, not the difference in the code itself.
As far as what the compiler does to make the code faster: the main thing it does is unroll a few iterations of the loop. We can get pretty much the same effect if we unroll a few iterations by hand:
uint64_t superCalculationC(int init, int end)
{
int f_end = end - ((end - init) & 7);
int i;
uint64_t total = 0;
for (i = init; i < f_end; i += 8) {
total += i;
total += i + 1;
total += i + 2;
total += i + 3;
total += i + 4;
total += i + 5;
total += i + 6;
total += i + 7;
}
for (; i < end; i++)
total += i;
return total;
}
This has a property that some might find rather odd: it's actually faster when compiled with -O2 than with -O3. When compiled with -O2, it's also about five times faster than either of the other two are when compiled with -O3.
The primary reason for the ~5x speed gain compared to the compiler's loop unrolling is that we've unrolled the loop somewhat differently (and more intelligently, IMO) than the compiler does. We compute f_end to tell us how many times the unrolled loop should execute. We execute those iterations, then we execute a separate loop to "clean up" any odd iterations at the end.
The compiler instead generates code that's roughly equivalent to something like this:
for (i = init; i < end; i += 8) {
total += i;
if (i + 1 >= end) break;
total += i + 1;
if (i + 2 >= end) break;
total += i + 2;
// ...
}
Although this is quite a bit faster than when the loop hasn't been unrolled at all, it's quite a bit faster still to eliminate those extra checks from the main loop, and execute a separate loop for any odd iterations.
Given such a trivial loop body being executed such a large number of times, you can also improve speed (when compiled with -O2) still further by unrolling more iterations of the loop. With 16 iterations unrolled, it was about twice as fast as the code above with 8 iterations unrolled:
uint64_t superCalculationC(int init, int end)
{
int first_end = end - ((end - init) & 0xf);
int i;
uint64_t total = 0;
for (i = init; i < first_end; i += 16) {
total += i + 0;
total += i + 1;
total += i + 2;
// code for `i+3` through `i+13` goes here
total += i + 14;
total += i + 15;
}
for (; i < end; i++)
total += i;
return total;
}
I haven't tried to explore the limit of gains from unrolling this particular loop, but unrolling 32 iterations nearly doubles the speed again. Depending on the processor you're using, you might get some small gains by unrolling 64 iterations, but I'd guess we're starting to approach the limits--at some point, performance gains will probably level off, then (if you unroll still more iterations) probably drop off, quite possibly dramatically.
Summary: with -O3 the compiler unrolls a number of iterations of the loop. This is extremely effective in this case, primarily because we have many executions of nearly the most trivial possible loop body. Unrolling the loop by hand is even more effective than letting the compiler do it--we can unroll more intelligently, and we can simply unroll more iterations than the compiler does. The extra intelligence can give us an improvement of around 5:1, and the extra iterations another 4:1 or so1 (at the expense of somewhat longer, slightly less readable code).
Final caveat: as always with optimization, your mileage may vary. Differences in compilers and/or processors mean you're likely to get at least somewhat different results than I did. I'd expect my hand-unrolled loop to be substantially faster than the other two in most cases, but exactly how much faster is likely to vary.
1. But note that this is comparing the hand-unrolled loop with -O2 to the original loop with -O3. When compiled with -O3, the hand-unrolled loop runs much more slowly.

Checking the assembly output is really the only way to illuminate such things.
Compiler optimisations will do a great deal of things, including things that are not strictly "standard compliant" (although, that is not the case with -O1 and -O2, to my knowledge) - for instance check, -Ofast switch.
I have found this helpful: http://gcc.godbolt.org/, and with your demo code here

-O2
Explaining the -O2 result is easy, looking at the code from godbolt change to -O2
main:
pushq %rbx
movl $.LC2, %edi
call puts
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
movl $.LC7, %edi
call puts
movl $.LC8, %edi
call puts
movl $.LC2, %edi
call puts
xorl %eax, %eax
popq %rbx
ret
There is no call to the 2 functions, further there is no compare of the results.
Now why can that be? its of course the power of optimization, the program is too simple ...
First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.
It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111
Both loops are now known to be containing only compile time values. They are further completely the same:
uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
total += i;
return total;
The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of
Rewriting Gauss's formel s=n*(n+1)
111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221
loops = 1000000111-111 = 1E9
half it as we got the double of the looked for
1000000221 * 1E9 / 2 = 500000110500000000
which is the result looked for 500000110500000000
Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.
The time noted is the minimum time for system_clock on your PC.
-O0
The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some
asm("nop");
in front of A's loop, 2-3 might do the trick.
Storeforwards also like that their values are naturally aligned.

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.
It appears that the performance difference in the -O0 case is due to processor pipelining.
First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # end
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L7
.L8:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # todo
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L11
.L12:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L11:
movl -20(%rbp), %edx # copy init to register rdx
movl -24(%rbp), %eax # copy todo to register rax
addl %edx, %eax # [rax] += [rdx] (so [rax] = init+todo)
cmpl -12(%rbp), %eax # i < [rax]
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
In both functions, the stack layout looks like this:
Addr Content
24 end/todo
20 init
16 <empty>
12 i
08 total
04
00 <base pointer>
(Note that total is a 64-bit int and so occupies two 4-byte slots.)
These are the key lines of superCalculationA():
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().
You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.
To demonstrate that this is just an artifact, let's replace
for (int i = init; i < end; i++)
with
for (int i = init; end > i; i++)
in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:
addl $1, -12(%rbp) # i++
.L7:
movl -24(%rbp), %eax # copy end to register rax
cmpl -12(%rbp), %eax # i < [rax]
Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:
=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.

(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)
The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.
For completeness, here is the assembly (thanks #Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L7
.L8:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L7:
movl -12(%rbp), %eax
cmpl -24(%rbp), %eax
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L11
.L12:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L11:
movl -20(%rbp), %edx
movl -24(%rbp), %eax
addl %edx, %eax
cmpl -12(%rbp), %eax
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
It sure looks to me like B is doing more work.
My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.
To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:
pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu
for cpuN in cpu[0-9]* ; do
echo userspace > $cpuN/cpufreq/scaling_governor
echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked
This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.
Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.
I have yet to see any theory that even begins to explain this.

Processors are complicated. Execution time depends on many things, many of which are outside your control. Just a few possibilities:
a. Your computer probably doesn't have a constant clock speed. It could be that the clock speed is usually set rather low to avoid wasting energy / battery life / producing excessive heat. When your program starts running, the OS figures out that power is needed and increases the clock speed. To verify, change the order of the calls - if the second loop executed is always faster than the first one, that may be the reason.
b. The exact execution speed, especially for a tight loop like yours, depends on how instructions are aligned in memory. Some processors may run a loop faster if it is completely contained within one cache line instead of two, or in two cache lines instead of three. Some compilers will add nop instructions to align loops on cache lines to optimise for this, most don't. Quite possible that one of the loops was aligned better by pure luck and therefore runs faster.
c. The exact execution speed may depend on the exact order in which instructions are dispatched. Slightly different code may run at different speeds due to subtle differences in the code which may be processor dependent, and anyway may be hard for the compiler to consider.
d. There is some evidence that Intel processors may have problems with artificially short loops which may happen only with artificial benchmarks. Your code is quite close to "artificial". There have been cases discussed in other threads where very short loops ran unexpectedly slow, and adding instructions made them run faster.

Answer of first question:
1- It makes faster after doing it once for for loops but i am not sure just commenting according to my experiment results.(experiment 1 change their names(B->A,A->B) experiment 2 run one function has for loop before time checks,experiment 3 start one for loop before time checks)
2- First programs should work faster the reason is second function is does 2 operation when first function does 1 operation.
I leave here updated code which explain my answer.
Answer of second question:
I am not sure but there can be two ways coming my mind,
It can be formalize your function in some way and get rid of loops because the difference
can be destroyed by that way(like "return end-init" or "return todo" i dunno, i'm not sure)
It has -fauto_inc_dec and it can make that difference because these functions all about increaments and decreaments.
I hope it can help.
#include <cstdint>
#include <ctime>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init+todo; i++)
total += i;
return total;
}
int add(int a1,int a2){printf("multiple times added\n");return a1+a2;}
uint64_t superCalculationC(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < add(init , todo); i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::clock_t start=clock();
double elapsed;
std::printf("=====================================================\n");
superCalculationA(111, 1000000111);
start = clock();
uint64_t ret1 = superCalculationA(111, 1000000111);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = clock();
uint64_t ret2 = superCalculationB(111, 1000000000);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}

Related

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.
If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0
This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.
Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.

Does redeclaring variables in C++ cost anything?

For readability, I think the first code block below is better. But is the second code block faster?
First Block:
for (int i = 0; i < 5000; i++){
int number = rand() % 10000 + 1;
string fizzBuzz = GetStringFromFizzBuzzLogic(number);
}
Second Block:
int number;
string fizzBuzz;
for (int i = 0; i < 5000; i++){
number = rand() % 10000 + 1;
fizzBuzz = GetStringFromFizzBuzzLogic(number);
}
Does redeclaring variables in C++ cost anything?
Any modern compiler will notice this and do the optimization work.
When in doubt, always go for the readability. Declare variables in as inner-most scope as you can.
I benchmarked this particular code, and even WITHOUT optimisation, it came to almost the same runtime for both variants. And as soon as the lowest level of optimisation is turned on, the result is very close to identical (+/- a bit of noise in the time measurement).
Edit: below analysis of the generated assembler code shows that it's hard to guess which form is faster, since the answer most people would probably give is func2, but it turns out this function is a tiny bit slower, at least when compiling with clang++ and -O2. And it's good evidence that "writ code, benchmark, change code, benchmark" is the correct way to deal with performance, not guessing based on reading the code. And remember what someone told me, optimising is a bit like taking an onion apart in layers - once you optimise one part, you end up looking at something very similar just a little smaller... ;)
However, my initial analysis made func1 significantly slower - that turns out to be becuse the compiler, for some bizarr reason, doesn't optimise the rand() % 10000 + 1 in func1 but does in func2 when optimisation is turned of. This means that func1. However, once optimisation is enabled, both functions gets a "fast" modulo.
Using the linux performance tool perf shows that with clang++ and -O2 we get the following for func1
15.76% a.out libc-2.20.so free
12.31% a.out libstdc++.so.6.0.20 std::string::_S_construct<char cons
12.29% a.out libc-2.20.so _int_malloc
10.05% a.out a.out func1
7.26% a.out libc-2.20.so __random
6.36% a.out libc-2.20.so malloc
5.46% a.out libc-2.20.so __random_r
5.01% a.out libstdc++.so.6.0.20 std::basic_string<char, std::char_t
4.83% a.out libstdc++.so.6.0.20 std::string::_Rep::_S_create
4.01% a.out libc-2.20.so strlen
and for func2:
17.88% a.out libc-2.20.so free
10.73% a.out libc-2.20.so _int_malloc
9.77% a.out libc-2.20.so malloc
9.03% a.out a.out func2
7.63% a.out libstdc++.so.6.0.20 std::string::_S_construct<char con
6.96% a.out libstdc++.so.6.0.20 std::string::_Rep::_S_create
4.48% a.out libc-2.20.so __random
4.39% a.out libc-2.20.so __random_r
4.10% a.out libc-2.20.so strlen
There are some subtle differences, but I would call those as being more to do with the relatively short runtime of the benchmark, rather than the difference in actual code generated by the compiler.
This is with the following code:
#include <iostream>
#include <string>
#include <cstdlib>
#define N 500000
extern std::string GetStringFromFizzBuzzLogic(int number);
void func1()
{
for (int i = 0; i < N; i++){
int number = rand() % 10000 + 1;
std::string fizzBuzz = GetStringFromFizzBuzzLogic(number);
}
}
void func2()
{
int number;
std::string fizzBuzz;
for (int i = 0; i < N; i++){
number = rand() % 10000 + 1;
fizzBuzz = GetStringFromFizzBuzzLogic(number);
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main(int argc, char **argv)
{
void (*f)();
if (argc == 1)
f = func1;
else
f = func2;
for(int i = 0; i < 5; i++)
{
unsigned long long t1 = rdtsc();
f();
t1 = rdtsc() - t1;
std::cout << "time=" << t1 << std::endl;
}
}
and in a separate file:
#include <string>
std::string GetStringFromFizzBuzzLogic(int number)
{
return "SomeString";
}
Running with func1:
./a.out
time=876016390
time=824149942
time=826812600
time=825266315
time=826151399
Running with func2:
./a.out
time=905721532
time=895393507
time=886537634
time=879836476
time=883887384
This is with another 0 added to N - so 10 times longer runtime - it seems that it's fairly consistently a little SLOWER, but it's a few percent, and probably within the noise, really - in time, the whole benchmark takes around 1.30-1.39 seconds.
Edit: Looking at the assembly code of the actual loop [this is only a portion of the loop, but the rest is identical in terms of what the code actutally does]
Func1:
.LBB0_1: # %for.body
callq rand
movslq %eax, %rcx
imulq $1759218605, %rcx, %rcx # imm = 0x68DB8BAD
movq %rcx, %rdx
shrq $63, %rdx
sarq $44, %rcx
addl %edx, %ecx
imull $10000, %ecx, %ecx # imm = 0x2710
negl %ecx
leal 1(%rax,%rcx), %esi
movq %r15, %rdi
callq _Z26GetStringFromFizzBuzzLogici
movq (%rsp), %rax
leaq -24(%rax), %rdi
cmpq %rbx, %rdi
jne .LBB0_2
.LBB0_7: # %_ZNSsD2Ev.exit
decl %ebp
jne .LBB0_1
Func2:
.LBB1_1:
callq rand
movslq %eax, %rcx
imulq $1759218605, %rcx, %rcx # imm = 0x68DB8BAD
movq %rcx, %rdx
shrq $63, %rdx
sarq $44, %rcx
addl %edx, %ecx
imull $10000, %ecx, %ecx # imm = 0x2710
negl %ecx
leal 1(%rax,%rcx), %esi
movq %rbx, %rdi
callq _Z26GetStringFromFizzBuzzLogici
movq %r14, %rdi
movq %rbx, %rsi
callq _ZNSs4swapERSs
movq (%rsp), %rax
leaq -24(%rax), %rdi
cmpq %r12, %rdi
jne .LBB1_4
.LBB1_9: # %_ZNSsD2Ev.exit19
incl %ebp
cmpl $5000000, %ebp # imm = 0x4C4B40
So, as can be seen, the func2 version contains an extra function call:
callq _ZNSs4swapERSs
which translates to std::basic_string<char, std::char_traits<char>, std::allocator<char> >::swap(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&) or std::string::swap(std::string&) - which is presumably the result of calling std::string::operator=(std::string &s). This would explain why func2 is slightly slower than func1.
I'm sure it is possible to find cases where constructing/destroying an object takes significant amounts of time in a loop, but in general, it will make little or no difference at all, and having clearer code will actually help the reader. It will also often help the compiler with "life-time analysis", since it's less code to "walk" to find out if the variable is used later (in this case, the code is short anyway, but that's obviously not always the case in real life examples)
The 1st code block should be considered faster, since you don't have any overhead for calling the std::string default constructor once.
Actually you don't have a redeclaration of the variables in your 2nd code block. These are just plain assignment operations.
A redeclaration would actually mean you have something like this
int number;
string fizzBuzz;
for (int i = 0; i < 5000; i++){
int number = rand() % 10000 + 1;
// ^^^
string fizzBuzz = GetStringFromFizzBuzzLogic(number);
// ^^^^^^
}
In this case the overhead would be optimized out by the compiler, since the outer scope variables aren't used at all.
There is no such thing as a redeclaration in C++. In your second code snippet, number and fizzBuzz are only declared and initialised once. The = which follow later on are assignments.
As with all optimisation questions, you can only guess or preferably measure. And then of course it all entirely depends on your compiler and the settings you invoke it with. And of course, there can be a tradeoff between speed optimisation and space optimisation.
I know of no serious C++ programmer who would not prefer the first form, because it is easier to read and simply more concise.
Only if the program would be considered too slow and if there was measuring on which parts of the code cause the slowdown and if those measurements pointed to this loop, only then would they consider changing it.
However, as the others said, this is an unrealistic scenario. It is extremely unlikely that a modern compiler would treat the two snippets in a different way with regards to optimisation and that you would experience any measurable speed difference.
(edit: Sorry for the typo, had confused "first" and "second" there)
All declaring (value) variables does is increment the stack by the combined size of all the local variables in that function/method.
There may be a cost to calling constructors /destructors more than the optimal amount of times with object types (your string).
In this case there is no difference. The optimizer will give you the best solution anyway if using a decent compiler.
You might want the code to read in the optimal way so your peers don't think you write bad code!

Why '==' is slow on std::string?

While profiling my application I realized that a lot of time is spent on string comparisons. So I wrote a simple benchmark and I was surprised that '==' is much slower than string::compare and strcmp! here is the code, can anyone explain why is that? or what's wrong with my code? because according to the standard '==' is just an operator overload and simply returnes !lhs.compare(rhs).
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 10000000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1 == s2)
a = i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1.compare(s2)==0)
a = i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(strcmp(s1.c_str(),s2.c_str()))
a = i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<endl;
return a;
}
And here is the result:
== took:5986.74
.compare took:0.000349
strcmp took:0.000778
And my compile flags:
CXXFLAGS = -O3 -Wall -fmessage-length=0 -std=c++1y
I use gcc 4.9 on a x86_64 linux machine.
Obviously using -o3 does some optimizations which I guess rolls out the last two loops totally; however, using -o2 still the results are weird:
for 1 billion iterations:
== took:19591
.compare took:8318.01
strcmp took:6480.35
P.S. Timer is just a wrapper class to measure spent time; I am absolutely sure about it :D
Code for Timer class:
#include <chrono>
#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_
class Timer {
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point stop;
public:
Timer(){
start = std::chrono::steady_clock::now();
stop = std::chrono::steady_clock::now();
}
virtual ~Timer() {}
inline void begin() {
start = std::chrono::steady_clock::now();
}
inline void end() {
stop = std::chrono::steady_clock::now();
}
inline double elapsedMillis() {
auto diff = stop - start;
return std::chrono::duration<double, std::milli> (diff).count();
}
inline double elapsedMicro() {
auto diff = stop - start;
return std::chrono::duration<double, std::micro> (diff).count();
}
inline double elapsedNano() {
auto diff = stop - start;
return std::chrono::duration<double, std::nano> (diff).count();
}
inline double elapsedSec() {
auto diff = stop - start;
return std::chrono::duration<double> (diff).count();
}
};
#endif /* SRC_TIMER_H_ */
UPDATE: output of improved benchmark at http://ideone.com/rGc36a
== took:21
.compare took:21
strcmp took:14
== took:21
.compare took:25
strcmp took:14
The thing that proved crucial to get it working meaningfully was "outwitting" the compiler's ability to predict the strings being compared at compile time:
// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };
if(s[i&3].compare(s[(i+1)&3])==0) // trickier to optimise
a += i; // cumulative observable side effects
Note that in general, strcmp is not functionally equivalent to == or .compare when the text may embed NULs, as the former will get to "exit early". (That's not the reason it's "faster" above, but do read below for comments re possible variations with string length/content etc..)
Discussion / Earlier answer
Just have a look at your implementation - e.g.
echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less
Search for the basic_string template, then for the operator== working on two string instances - mine is:
template<class _Elem,
class _Traits,
class _Alloc> inline
bool __cdecl operator==(
const basic_string<_Elem, _Traits, _Alloc>& _Left,
const basic_string<_Elem, _Traits, _Alloc>& _Right)
{
return (_Left.compare(_Right) == 0);
}
Notice that operator== is inline and simply calls compare. There's no way it's consistently significantly slower with normal optimisation levels enabled, though the optimiser might occasionally happen to optimise one loop better than another due to subtle side effects of surrounding code.
Your ostensible problem will have been caused by e.g. your code being optimised beyond the point of doing the intended work, for loops arbitrarily unrolled to different degrees, or other quirks or bugs in the optimisation or your timing. That's not unusual when you have unvarying inputs and loops that don't have any cumulative side-effects (i.e. the compiler can work out that intermediate values of a are not used, so only the last a = i need take affect).
So, learn to write better benchmarks. In this case, that's a bit tricky as having lots of distinct strings in memory ready to invoke the comparisons on, and selecting them in a way that the optimiser can't predict at compile time that's still fast enough not to overwhelm and obscure the impact of the string comparison code, is not an easy task. Further, beyond a point - comparing things spread across more memory makes cache affects more relevant to the benchmark, which further obscures the real comparison performance.
Still, if I were you I'd read some strings from a file - pushing each to a vector, then loop over the vector doing each of the three comparison operations between adjacent elements. Then the compiler can't possibly predict any pattern in the outcomes. You might find compare/== faster/slower than strcmp for strings often differing in the first character or three, but the other way around for long strings that are equal or only differing near the end, so make sure you try different kinds of input before you conclude you understand the performance profile.
Either your timings are screwy, or your compiler has optimised some of your code out of existence.
Think about it, ten billion operations in 0.000349 milliseconds (I'll use 0.000500 milliseconds, or half a microsecond, to make my calculations easier) means that you're performing twenty trillion operations per second.
Even if one operation could be done in a single clock cycle, that would be 20,000 GHz, a bit beyond the current crop of CPUs, even with their massively optimised pipelines and multiple cores.
And, given that the -O2 optimised figures are more on par with each other (== taking about double the time of compare), the "code optimised out of existence" possibility is looking far more likely.
The doubling of time could easily be explained as ten billion extra function calls, due to operator== needing to call compare to do its work.
As further support, examine the following table, showing figures in milliseconds (third column is simple divide-by-ten scale of second column so that both first and third columns are for a billion iterations):
-O2/1billion -O3/10billion -O3/1billion Improvement
(a) (b) (c = b / 10) (a / c)
============ ============= ============ ===========
oper== 19151 5987 599 32
compare 8319 0.0005 0.00005 166,380,000
It beggars belief that -O3 could speed up the == code by a factor of about 32 but manage to speed up the compare code by a factor of a few hundred million.
I strongly suggest you have a look at the assembler code generated by your compiler (such as with the gcc -S option) to verify that it's actually doing that work it's claiming to do.
The problem is that the compiler is making a lot of serious optimizations to your code.
Here's the modified code:
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 500000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1 == s2)
a += i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1.compare(s2)==0)
a+=i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(strcmp(s1.c_str(),s2.c_str()) == 0)
a+=i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<",a="<<a<< endl;
return a;
}
where I've added asm volatile("" : "+g"(s2)); to force the compiler to run the comparison. I've also added <<",a="< to force the compiler to compute a.
The output is now:
== took:10221.5,a=0
.compare took:10739,a=0
strcmp took:9700,a=0
Can you explain why strcmp is faster than .compare which is slower than ==? however, the speed differences are marginal, but significant.
It actually makes sense! :p
The speed analysis below is wrong - thanks to Tony D for pointing out my error. The criticisms and advice for better benchmarks still apply though.
All the previous answers deal with the compiler optimisation issues in your benchmark, but don't answer why strcmp is still slightly faster.
strcmp is likely faster (in the corrected benchmarks) due to the strings sometimes containing zeros. Since strcmp uses C-strings it can exit when it comes across the string termination char '\0'. std::string::compare() treats '\0' as just another char and continues until the end of the string array.
Since you have non-deterministically seeded the RNG, and only generated two strings, your results will change with every run of the code. (I'd advise against this in benchmarks.) Given the numbers, 28 times out of 128, there ought to be no advantage. 10 times out of 128 you will get more than a 10-fold speed up. And so on.
As well as defeating the compiler's optimiser, I would suggest that, next time, you generate a new string for each comparison iteration, allowing you to average away such effects.
Compiled the code with gcc -O3 -S --std=c++1y. The result is here. gcc version is:
gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Look at it, we can be the first loop (operator ==) is like this: (comment is added by me)
movq itr(%rip), %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L25
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
xorl %ebx, %ebx
movq -24(%rsi), %rdx ; length of string1
cmpq -24(%rdi), %rdx ; compare lengths
je .L53 ; compare content only when length is the same
.L10
; end of loop, print out follows
;....
.L53:
.cfi_restore_state
call memcmp ; compare content
xorl %edx, %edx ; zero loop count
.p2align 4,,10
.p2align 3
.L13:
testl %eax, %eax ; check result
cmove %rdx, %rbx ; a = i
addq $1, %rdx ; i++
cmpq %rbp, %rdx ; i < itr?
jne .L13
jmp .L10
; ....
.L25:
xorl %ebx, %ebx
jmp .L10
We can see that operator == is inline, only a call to memcmp is there. And for operator ==, if the length is different, the content is not compared.
Most importantly, compare is done only once. The loop content only contains i++;, a=i;, i<itr;.
For the second loop (compare()):
movq itr(%rip), %r12
movq %rax, %r13
movq %rax, 56(%rsp)
testq %r12, %r12
je .L14
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
movq -24(%rdi), %rbp
movq -24(%rsi), %r14 ; read and compare length
movq %rbp, %rdx
cmpq %rbp, %r14
cmovbe %r14, %rdx ; save the shorter length of the two string to %rdx
subq %r14, %rbp ; length difference in %rbp
call memcmp ; content is always compared
movl $2147483648, %edx ; 0x80000000 sign extended
addq %rbp, %rdx ; revert the sign bit of %rbp (length difference) and save to %rdx
testl %eax, %eax ; memcmp returned 0?
jne .L14 ; no, string different
testl %ebp, %ebp ; memcmp returned 0. Are lengths the same (%ebp == 0)?
jne .L14 ; no, string different
movl $4294967295, %eax ; string compare equal
subq $1, %r12 ; itr - 1
cmpq %rax, %rdx
cmovbe %r12, %rbx ; a = itr - 1
.L14:
; output follows
There no loop at all here.
In compare(), as it should return plus, minus, or zero based on the comparison, string content is always compared. memcmp called once.
For the third loop (strcmp()), the assembly is the most simple:
movq itr(%rip), %rbp ; itr to %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L16
movq 32(%rsp), %rsi
movq 16(%rsp), %rdi
subq $1, %rbp ; itr - 1 to %rbp
call strcmp
testl %eax, %eax ; test compare result
cmovne %rbp, %rbx ; if not equal, save itr - 1 to %rbx (a)
.L16:
These also no loop at all. strcmp is called, and if the strings are not equal (as in your code), save itr-1 to a directly.
So your benchmark cannot test the running time for operator ==, compare() or strcmp(). The are all called only once, not able to show the running time difference.
As to why operator == takes the most time, it is because for operator==, the compiler for some reason did not eliminate the loop. The loop takes time (but the loop does not contain string comparison at all).
And from the assembly shown, we may assume that operator == may be fastest, because it won't do string comparison at all if the length of the two strings are different. (Of course, under gcc4.9.1 -O3)

C++ conversion optimization

I would like to ask if there is a quicker way to do my audio conversion than by iterating through all values one by one and dividing them through 32768.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
My approach works fine, but it could be quicker. However I did not find any way to speed it up.
Thank you for the help!
If your array is large enough it may be worthwhile to parallelize this for loop. OpenMP's parallel for statement is what I would use.
The function would then be:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
#pragma omp parallel for
for (int i = 0; i < uIntegers.size(); i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
with gcc you need to pass -fopenmp when you compile for the pragma to be used, on MSVC it is /openmp. Since spawning threads has a noticeable overhead, this will only be faster if you are processing large arrays, YMMV.
For maximum speed you want to convert more than one value per loop iteration. The easiest way to do that is with SIMD. Here's roughly how you'd do it with SSE2:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
__m128d scale = _mm_set_pd( 1.0 / 32768.0, 1.0 / 32768.0 );
int i = 0;
for ( ; i < uIntegers.size() - 3; i += 4)
{
__m128i x = _mm_loadu_si128(&uIntegers[i]);
__m128i y = _mm_shuffle_epi32(x, _MM_SHUFFLE(2,3,0,0) );
__m128d dx = _mm_cvtepi32_pd(x);
__m128d dy = _mm_cvtepi32_pd(y);
dx = _mm_mul_pd(dx, scale);
dy = _mm_mul_pd(dy, scale);
_mm_storeu_pd(dx, &uDoubles[i]);
_mm_storeu_pd(dy, &uDoubles[i + 2]);
}
// Finish off the last 0-3 elements the slow way
for ( ; i < uIntegers.size(); i ++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
We process four integers per loop iteration. As we can only fit two doubles in the registers there's some duplicated work, but the extra unrolling will help performance unless the arrays are tiny.
Changing the data types to smaller ones (say short and float) should also help performance, because they cut down on memory bandwidth, and you can fit four floats in an SSE register. For audio data you shouldn't need the precision of a double.
Note that I've used unaligned loads and stores. Aligned ones will be slightly quicker if the data is actually aligned (which it won't be by default, and it's hard to make stuff aligned inside a std::vector).
Your function is highly parallelizable. On modern Intel CPU there are three independent ways to parallelize: Instruction level parallelism (ILP), thread level parallelism (TLP), and SIMD. I was able to use all three to get big boosts in your function. The results are compiler dependent though. The boost is much less using GCC since it already vectorizes the function. See the table of numbers below.
However, the main limiting factor in your function is that it's time complexity is only O(n) and so there is a drastic drop in efficiency when the size of the array you're running over crosses each cache level boundary. If you look at for example at large dense matrix multiplication (GEMM) it's a O(n^3) operation so if one does things right (using e.g. loop tiling) the cache hierarchy is not a problem: you can get close to the maximum flops/s even for very large matrices (which seems to indicate that GEMM is one of the thinks Intel thinks of when they design the CPU). The way to fix this in your case is to find a way to do your function on a L1 cache block right after/before you do a more complex operation (for example that goes as O(n^2)) and then move to another L1 block. Of course I don't know what you're doing so I can't do that.
ILP is partially done for you by the CPU hardware. However, often carried loop dependencies limit the ILP so it often helps to do loop unrolling to take full advantage of the ILP. For TLP I use OpenMP, and for SIMD I used AVX (however the code below works for SSE as well). I used 32 byte aligned memory and made sure the array was a multiple of 8 so that no clean up was necessary.
Here are the results from Visual Studio 2012 64bit with AVX and OpenMP (release mode obviously) SandyBridge EP 4 cores (8 HW threads) #3.6 GHz. The variable n is the number of items. I repeat the function several times as well so the total time includes that. The function convert_vec4_unroll2_openmp gives the best results except in the L1 region. You can also cleary see that the efficiency drops significantly each time you move to a new cache level but even for main memory it's still better.
l1 chache, n 2752, repeat 300000
covert time 1.34, error 0.000000
convert_vec4 time 0.16, error 0.000000
convert_vec4_unroll2 time 0.16, error 0.000000
convert_vec4_unroll2_openmp time 0.31, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 1.14, error 0.000000
convert_vec4 time 0.24, error 0.000000
convert_vec4_unroll2 time 0.24, error 0.000000
convert_vec4_unroll2_openmp time 0.12, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 1.23, error 0.000000
convert_vec4 time 0.44, error 0.000000
convert_vec4_unroll2 time 0.45, error 0.000000
convert_vec4_unroll2_openmp time 0.14, error 0.000000
main memory , n 8738144, repeat 100
covert time 1.56, error 0.000000
convert_vec4 time 0.95, error 0.000000
convert_vec4_unroll2 time 0.89, error 0.000000
convert_vec4_unroll2_openmp time 0.51, error 0.000000
Results with g++ foo.cpp -mavx -fopenmp -ffast-math -O3 on a i5-3317 (ivy bridge) # 2.4 GHz 2 cores (4 HW threads). GCC seems to vectorize this and the only benefit comes from OpenMP (which, however, gives a worse result in the L1 region).
l1 chache, n 2752, repeat 300000
covert time 0.26, error 0.000000
convert_vec4 time 0.22, error 0.000000
convert_vec4_unroll2 time 0.21, error 0.000000
convert_vec4_unroll2_openmp time 0.46, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 0.28, error 0.000000
convert_vec4 time 0.27, error 0.000000
convert_vec4_unroll2 time 0.27, error 0.000000
convert_vec4_unroll2_openmp time 0.20, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 0.80, error 0.000000
convert_vec4 time 0.80, error 0.000000
convert_vec4_unroll2 time 0.80, error 0.000000
convert_vec4_unroll2_openmp time 0.83, error 0.000000
main memory chache, n 8738144, repeat 100
covert time 1.10, error 0.000000
convert_vec4 time 1.10, error 0.000000
convert_vec4_unroll2 time 1.10, error 0.000000
convert_vec4_unroll2_openmp time 1.00, error 0.000000
Here is the code. I use the vectorclass http://www.agner.org/optimize/vectorclass.zip to do SIMD. This will use either AVX to write 4 doubles at once or SSE to write 2 doubles at once.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "vectorclass.h"
void convert(const int *uIntegers, double *uDoubles, const int n) {
for ( int i = 0; i<n; i++) {
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
void convert_vec4(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d*=div;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
void convert_vec4_openmp(const int *uIntegers, double *uDoubles, const int n) {
#pragma omp parallel for
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d/=32768.0;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2_openmp(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
#pragma omp parallel for
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
double compare(double *a, double *b, const int n) {
double diff = 0.0;
for(int i=0; i<n; i++) {
double tmp = a[i] - b[i];
//printf("%d %f %f \n", i, a[i], b[i]);
if(tmp<0) tmp*=-1;
diff += tmp;
}
return diff;
}
void loop(const int n, const int repeat, const int ifunc) {
void (*fp[4])(const int *uIntegers, double *uDoubles, const int n);
int *a = (int*)_mm_malloc(sizeof(int)* n, 32);
double *b1_cmp = (double*)_mm_malloc(sizeof(double)*n, 32);
double *b1 = (double*)_mm_malloc(sizeof(double)*n, 32);
double dtime;
const char *fp_str[] = {
"covert",
"convert_vec4",
"convert_vec4_unroll2",
"convert_vec4_unroll2_openmp",
};
for(int i=0; i<n; i++) {
a[i] = rand()*RAND_MAX;
}
fp[0] = convert;
fp[1] = convert_vec4;
fp[2] = convert_vec4_unroll2;
fp[3] = convert_vec4_unroll2_openmp;
convert(a, b1_cmp, n);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) {
fp[ifunc](a, b1, n);
}
dtime = omp_get_wtime() - dtime;
printf("\t%s time %.2f, error %f\n", fp_str[ifunc], dtime, compare(b1_cmp,b1,n));
_mm_free(a);
_mm_free(b1_cmp);
_mm_free(b1);
}
int main() {
double dtime;
int l1 = (32*1024)/(sizeof(int) + sizeof(double));
int l2 = (256*1024)/(sizeof(int) + sizeof(double));
int l3 = (8*1024*1024)/(sizeof(int) + sizeof(double));
int lx = (100*1024*1024)/(sizeof(int) + sizeof(double));
int n[] = {l1, l2, l3, lx};
int repeat[] = {300000, 30000, 1000, 100};
const char *cache_str[] = {"l1", "l2", "l3", "main memory"};
for(int c=0; c<4; c++ ) {
int lda = ((n[c]+7) & -8); //make sure array is a multiple of 8
printf("%s chache, n %d\n", cache_str[c], lda);
for(int i=0; i<4; i++) {
loop(lda, repeat[c], i);
} printf("\n");
}
}
Lastly, anyone who has read this far and feels like reminding me that my code looks more like C than C++ please read this first before you decide to comment http://www.stroustrup.com/sibling_rivalry.pdf
You might also try:
uDoubles[i] = ldexp((double)uIntegers[i], -15);
Edit: See Adam's answer above for a version using SSE intrinsics. Better than what I had here ...
To make this more useful, let's look at compiler-generated code here. I'm using gcc 4.8.0 and yes, it is worth checking your specific compiler (version) as there are quite significant differences in output for, say, gcc 4.4, 4.8, clang 3.2 or Intel's icc.
Your original, using g++ -O8 -msse4.2 ... translates into the following loop:
.L2:
cvtsi2sd (%rcx,%rax,4), %xmm0
mulsd %xmm1, %xmm0
addl $1, %edx
movsd %xmm0, (%rsi,%rax,8)
movslq %edx, %rax
cmpq %rdi, %rax
jbe .L2
where %xmm1 holds 1.0/32768.0 so the compiler automatically turns the division into multiplication-by-reverse.
On the other hand, using g++ -msse4.2 -O8 -funroll-loops ..., the code created for the loop changes significantly:
[ ... ]
leaq -1(%rax), %rdi
movq %rdi, %r8
andl $7, %r8d
je .L3
[ ... insert a duff's device here, up to 6 * 2 conversions ... ]
jmp .L3
.p2align 4,,10
.p2align 3
.L39:
leaq 2(%rsi), %r11
cvtsi2sd (%rdx,%r10,4), %xmm9
mulsd %xmm0, %xmm9
leaq 5(%rsi), %r9
leaq 3(%rsi), %rax
leaq 4(%rsi), %r8
cvtsi2sd (%rdx,%r11,4), %xmm10
mulsd %xmm0, %xmm10
cvtsi2sd (%rdx,%rax,4), %xmm11
cvtsi2sd (%rdx,%r8,4), %xmm12
cvtsi2sd (%rdx,%r9,4), %xmm13
movsd %xmm9, (%rcx,%r10,8)
leaq 6(%rsi), %r10
mulsd %xmm0, %xmm11
mulsd %xmm0, %xmm12
movsd %xmm10, (%rcx,%r11,8)
leaq 7(%rsi), %r11
mulsd %xmm0, %xmm13
cvtsi2sd (%rdx,%r10,4), %xmm14
mulsd %xmm0, %xmm14
cvtsi2sd (%rdx,%r11,4), %xmm15
mulsd %xmm0, %xmm15
movsd %xmm11, (%rcx,%rax,8)
movsd %xmm12, (%rcx,%r8,8)
movsd %xmm13, (%rcx,%r9,8)
leaq 8(%rsi), %r9
movsd %xmm14, (%rcx,%r10,8)
movsd %xmm15, (%rcx,%r11,8)
movq %r9, %rsi
.L3:
cvtsi2sd (%rdx,%r9,4), %xmm8
mulsd %xmm0, %xmm8
leaq 1(%rsi), %r10
cmpq %rdi, %r10
movsd %xmm8, (%rcx,%r9,8)
jbe .L39
[ ... out ... ]
So it blocks the operations up, but still converts one-value-at-a-time.
If you change your original loop to operate on a few elements per iteration:
size_t i;
for (i = 0; i < uIntegers.size() - 3; i += 4)
{
uDoubles[i] = uIntegers[i] / 32768.0;
uDoubles[i+1] = uIntegers[i+1] / 32768.0;
uDoubles[i+2] = uIntegers[i+2] / 32768.0;
uDoubles[i+3] = uIntegers[i+3] / 32768.0;
}
for (; i < uIntegers.size(); i++)
uDoubles[i] = uIntegers[i] / 32768.0;
the compiler, gcc -msse4.2 -O8 ... (i.e. even without requesting unrolling), identifies the potential to use CVTDQ2PD/MULPD and the core of the loop becomes:
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rcx), %xmm0
addq $16, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movlpd %xmm1, (%rdx,%rax,8)
movhpd %xmm1, 8(%rdx,%rax,8)
movlpd %xmm0, 16(%rdx,%rax,8)
movhpd %xmm0, 24(%rdx,%rax,8)
addq $4, %rax
cmpq %r8, %rax
jb .L4
cmpq %rdi, %rax
jae .L29
[ ... duff's device style for the "tail" ... ]
.L29:
rep ret
I.e. now the compiler recognizes the opportunity to put two double per SSE register, and do parallel multiply / conversion. This is pretty close to the code that Adam's SSE intrinsics version would generate.
The code in total (I've shown only about 1/6th of it) is much more complex than the "direct" intrinsics, due to the fact that, as mentioned, the compiler tries to prepend/append unaligned / not-block-multiple "heads" and "tails" to the loop. It largely depends on the average/expected sizes of your vectors whether this will be beneficial or not; for the "generic" case (vectors more than twice the size of the block processed by the "innermost" loop), it'll help.
The result of this exercise is, largely ... that, if you coerce (by compiler options/optimization) or hint (by slightly rearranging the code) your compiler to do the right thing, then for this specific kind of copy/convert loop, it comes up with code that's not going to be much behind hand-written intrinsics.
Final experiment ... make the code:
static double c(int x) { return x / 32768.0; }
void Convert(const std::vector<int>& uIntegers, std::vector<double>& uDoubles)
{
std::transform(uIntegers.begin(), uIntegers.end(), uDoubles.begin(), c);
}
and (for the nicest-to-read assembly output, this time using gcc 4.4 with gcc -O8 -msse4.2 ...) the generated assembly core loop (again, there's a pre/post bit) becomes:
.p2align 4,,10
.p2align 3
.L8:
movdqu (%r9,%rax), %xmm0
addq $1, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movapd %xmm1, (%rsi,%rax,2)
movapd %xmm0, 16(%rsi,%rax,2)
addq $16, %rax
cmpq %rcx, %rdi
ja .L8
cmpq %rbx, %rbp
leaq (%r11,%rbx,4), %r11
leaq (%rdx,%rbx,8), %rdx
je .L10
[ ... ]
.L10:
[ ... ]
ret
With that, what do we learn ? If you want to use C++, really use C++ ;-)
Let me try another way:
If multiplying is seriously better from the perspective of assembly instructions, then this should guarantee that it will get multiplied.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] * 0.000030517578125;
}
}

Difficulties to measure C/C++ performance

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:
#include <stdio.h>
#include <time.h>
/// #return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}
int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern
int const Pattern_0_0_2_2 = 0x40000002;
/// #return - half of Billion
int unpredictableIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// true, true, false, false ...
if ( ( i & Pattern_0_0_2_2 ) == 0 )
{
++sum;
}
}
return sum;
}
/// #return - half of Billion
int noIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// 1, 1, 0, 0 ...
sum += ( i & Pattern_0_0_2_2 ) == 0;
}
return sum;
}
int main()
{
clock_t volatile start;
clock_t volatile stop;
int volatile sum;
printf( "Puzzling measurements:\n" );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = noIfs();
stop = clock();
printf( "Same without ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
}
Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000
Edit: Full switches of compiler:
/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue
Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Also he posted such results for -O3 optimizations:
Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Now I have question. What is going on here?
More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?
Thanks for attention!
Edit:
I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.
I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.
With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)
With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.
Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.
With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):
.L12:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L12
and for noIfs it's similar:
.L15:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
cmpl $1000000000, %eax
jne .L15
where it was
.L7:
testl $1073741826, %edx
sete %cl
movzbl %cl, %ecx
addl %ecx, %eax
addl $1, %edx
cmpl $1000000000, %edx
jne .L7
with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.
With -O3, the loop for unpredictableIfs becomes worse,
.L14:
leal 1(%rdx), %ecx
testl $1073741826, %eax
cmove %ecx, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L14
and for noIfs (including the setup-code here), it becomes better:
pxor %xmm2, %xmm2
movq %rax, 32(%rsp)
movdqa .LC3(%rip), %xmm6
xorl %eax, %eax
movdqa .LC2(%rip), %xmm1
movdqa %xmm2, %xmm3
movdqa .LC4(%rip), %xmm5
movdqa .LC5(%rip), %xmm4
.p2align 4,,10
.p2align 3
.L18:
movdqa %xmm1, %xmm0
addl $1, %eax
paddd %xmm6, %xmm1
cmpl $250000000, %eax
pand %xmm5, %xmm0
pcmpeqd %xmm3, %xmm0
pand %xmm4, %xmm0
paddd %xmm0, %xmm2
jne .L18
.LC2:
.long 0
.long 1
.long 2
.long 3
.align 16
.LC3:
.long 4
.long 4
.long 4
.long 4
.align 16
.LC4:
.long 1073741826
.long 1073741826
.long 1073741826
.long 1073741826
.align 16
.LC5:
.long 1
.long 1
.long 1
.long 1
it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.
Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.
With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]
UnpredictableIfs:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
NoIfs:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
As you can see, it's not quite identical, but it does very similar things.
Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.
Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.
Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.
UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.
Observations:
In all cases the unpredictableIfs() code was inlined.
Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
Setting thread affinity to a single processor had no effect.
The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)
I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.
UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.
With volatile:
Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000
The disassembly for each instance looks like this:
start = clock();
010D1015 mov esi,dword ptr [__imp__clock (10D20A0h)]
010D101B add esp,4
010D101E call esi
010D1020 mov dword ptr [start],eax
sum = unpredictableIfs();
010D1023 xor ecx,ecx
010D1025 xor eax,eax
010D1027 test eax,40000002h
010D102C jne main+2Fh (10D102Fh)
010D102E inc ecx
010D102F inc eax
010D1030 cmp eax,3B9ACA00h
010D1035 jl main+27h (10D1027h)
010D1037 mov dword ptr [sum],ecx
stop = clock();
010D103A call esi
010D103C mov dword ptr [stop],eax
Without volatile:
Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
start = clock();
00321014 mov esi,dword ptr [__imp__clock (3220A0h)]
0032101A add esp,4
0032101D call esi
0032101F mov dword ptr [start],eax
sum = unpredictableIfs();
00321022 xor ebx,ebx
00321024 xor eax,eax
00321026 test eax,40000002h
0032102B jne main+2Eh (32102Eh)
0032102D inc ebx
0032102E inc eax
0032102F cmp eax,3B9ACA00h
00321034 jl main+26h (321026h)
stop = clock();
00321036 call esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.
Other than register selection, I don't see a significant difference.