I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:
#include <stdio.h>
#include <time.h>
/// #return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}
int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern
int const Pattern_0_0_2_2 = 0x40000002;
/// #return - half of Billion
int unpredictableIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// true, true, false, false ...
if ( ( i & Pattern_0_0_2_2 ) == 0 )
{
++sum;
}
}
return sum;
}
/// #return - half of Billion
int noIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// 1, 1, 0, 0 ...
sum += ( i & Pattern_0_0_2_2 ) == 0;
}
return sum;
}
int main()
{
clock_t volatile start;
clock_t volatile stop;
int volatile sum;
printf( "Puzzling measurements:\n" );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = noIfs();
stop = clock();
printf( "Same without ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
}
Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000
Edit: Full switches of compiler:
/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue
Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Also he posted such results for -O3 optimizations:
Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Now I have question. What is going on here?
More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?
Thanks for attention!
Edit:
I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.
I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.
With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)
With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.
Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.
With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):
.L12:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L12
and for noIfs it's similar:
.L15:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
cmpl $1000000000, %eax
jne .L15
where it was
.L7:
testl $1073741826, %edx
sete %cl
movzbl %cl, %ecx
addl %ecx, %eax
addl $1, %edx
cmpl $1000000000, %edx
jne .L7
with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.
With -O3, the loop for unpredictableIfs becomes worse,
.L14:
leal 1(%rdx), %ecx
testl $1073741826, %eax
cmove %ecx, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L14
and for noIfs (including the setup-code here), it becomes better:
pxor %xmm2, %xmm2
movq %rax, 32(%rsp)
movdqa .LC3(%rip), %xmm6
xorl %eax, %eax
movdqa .LC2(%rip), %xmm1
movdqa %xmm2, %xmm3
movdqa .LC4(%rip), %xmm5
movdqa .LC5(%rip), %xmm4
.p2align 4,,10
.p2align 3
.L18:
movdqa %xmm1, %xmm0
addl $1, %eax
paddd %xmm6, %xmm1
cmpl $250000000, %eax
pand %xmm5, %xmm0
pcmpeqd %xmm3, %xmm0
pand %xmm4, %xmm0
paddd %xmm0, %xmm2
jne .L18
.LC2:
.long 0
.long 1
.long 2
.long 3
.align 16
.LC3:
.long 4
.long 4
.long 4
.long 4
.align 16
.LC4:
.long 1073741826
.long 1073741826
.long 1073741826
.long 1073741826
.align 16
.LC5:
.long 1
.long 1
.long 1
.long 1
it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.
Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.
With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]
UnpredictableIfs:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
NoIfs:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
As you can see, it's not quite identical, but it does very similar things.
Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.
Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.
Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.
UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.
Observations:
In all cases the unpredictableIfs() code was inlined.
Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
Setting thread affinity to a single processor had no effect.
The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)
I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.
UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.
With volatile:
Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000
The disassembly for each instance looks like this:
start = clock();
010D1015 mov esi,dword ptr [__imp__clock (10D20A0h)]
010D101B add esp,4
010D101E call esi
010D1020 mov dword ptr [start],eax
sum = unpredictableIfs();
010D1023 xor ecx,ecx
010D1025 xor eax,eax
010D1027 test eax,40000002h
010D102C jne main+2Fh (10D102Fh)
010D102E inc ecx
010D102F inc eax
010D1030 cmp eax,3B9ACA00h
010D1035 jl main+27h (10D1027h)
010D1037 mov dword ptr [sum],ecx
stop = clock();
010D103A call esi
010D103C mov dword ptr [stop],eax
Without volatile:
Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
start = clock();
00321014 mov esi,dword ptr [__imp__clock (3220A0h)]
0032101A add esp,4
0032101D call esi
0032101F mov dword ptr [start],eax
sum = unpredictableIfs();
00321022 xor ebx,ebx
00321024 xor eax,eax
00321026 test eax,40000002h
0032102B jne main+2Eh (32102Eh)
0032102D inc ebx
0032102E inc eax
0032102F cmp eax,3B9ACA00h
00321034 jl main+26h (321026h)
stop = clock();
00321036 call esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.
Other than register selection, I don't see a significant difference.
Related
I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter.
double sum = 0.0;
int i;
tm = readTSC();
for ( i = 0; i < n; i++ )
sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks in total\n",tm);
printf("%15.6e\n",sum);
However, as I printed out the assembly code using
gcc -S timing.c -o timing.s
on an Intel machine, the result (shown below) was surprising?
Why there are two sqrts in the assembly code with one using the sqrtsd instruction and the other using a function call? Is it related to loop unrolling and trying to execute two sqrts in one iteration?
And how to understand the line
ucomisd %xmm0, %xmm0
Why does it compare %xmm0 to itself?
//----------------start of for loop----------------
call readTSC
movq %rax, -32(%rbp)
movl $0, -4(%rbp)
jmp .L4
.L6:
cvtsi2sd -4(%rbp), %xmm1
// 1. use sqrtsd instruction
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L8
je .L5
.L8:
movapd %xmm1, %xmm0
// 2. use C funciton call
call sqrt
.L5:
movsd -16(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -16(%rbp)
addl $1, -4(%rbp)
.L4:
movl -4(%rbp), %eax
cmpl -36(%rbp), %eax
jl .L6
//----------------end of for loop----------------
call readTSC
It's using the library sqrt function for error handling. See glibc's documentation: 20.5.4 Error Reporting by Mathematical Functions: math functions set errno for compatibility with systems that don't have IEEE754 exception flags. Related: glibc's math_error(7) man page.
As an optimization, it first tries to perform the square root by the inlined sqrtsd instruction, then checks the result against itself using the ucomisd instruction which sets the flags as follows:
CASE (RESULT) OF
UNORDERED: ZF,PF,CF 111;
GREATER_THAN: ZF,PF,CF 000;
LESS_THAN: ZF,PF,CF 001;
EQUAL: ZF,PF,CF 100;
ESAC;
In particular, comparing a QNaN to itself will return UNORDERED, which is what you will get if you try to take the square root of a negative number. This is covered by the jp branch. The je check is just paranoia, checking for exact equality.
Also note that gcc has a -fno-math-errno option which will sacrifice this error handling for speed. This option is part of -ffast-math, but can be used on its own without enabling any result-changing optimizations.
sqrtsd on its own correctly produces NaN for negative and NaN inputs, and sets the IEEE754 Invalid flag. The check and branch is only to preserve the errno-setting semantics which most code doesn't rely on.
-fno-math-errno is the default on Darwin (OS X), where the math library never sets errno, so functions can be inlined without this check.
I want to vectorize the fortran below with SIMD directives
!DIR$ SIMD
DO IELEM = 1 , NELEM
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
ENDDO
And I used the instruction avx2. The program is compiled by
ifort main_vec.f -simd -g -pg -O2 -vec-report6 -o vec.out -xcore-avx2 -align array32byte
Then I'd like to add VECTORLENGTH(n) clause after SIMD.
If there's no such a clause or n = 2, 4, the information doesn't give information about the unroll factor
if n = 8, 16, vectorization support: unroll factor set to 2.
I've read Intel's article about vectorization support: unroll factor set to xxxx So I guess the loop is unrolled to something like:
DO IELEM = 1 , NELEM, 2
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
X(IKLE(IELEM+1)) = X(IKLE(IELEM+1)) + W(IELEM+1)
ENDDO
Then 2 X go into a vector register, 2 W go to another, do the addition.
But how does the value of VECTORLENGTH work? Or maybe I don't really understand what does the vector length mean.
And since I use the avx2 instruction, for the DOUBLE PRECISION type X, what's the maximum length could be reach?
Here's part of the assembly of the loop with SSE2, VL=8 and the compiler told me that unroll factor is 2. However it used 4 registers instead of 2.
.loc 1 114 is_stmt 1
movslq main_vec_$IKLE.0.1(,%rdx,4), %rsi #114.9
..LN202:
movslq 4+main_vec_$IKLE.0.1(,%rdx,4), %rdi #114.9
..LN203:
movslq 8+main_vec_$IKLE.0.1(,%rdx,4), %r8 #114.9
..LN204:
movslq 12+main_vec_$IKLE.0.1(,%rdx,4), %r9 #114.9
..LN205:
movsd -8+main_vec_$X.0.1(,%rsi,8), %xmm0 #114.26
..LN206:
movslq 16+main_vec_$IKLE.0.1(,%rdx,4), %r10 #114.9
..LN207:
movhpd -8+main_vec_$X.0.1(,%rdi,8), %xmm0 #114.26
..LN208:
movslq 20+main_vec_$IKLE.0.1(,%rdx,4), %r11 #114.9
..LN209:
movsd -8+main_vec_$X.0.1(,%r8,8), %xmm1 #114.26
..LN210:
movslq 24+main_vec_$IKLE.0.1(,%rdx,4), %r14 #114.9
..LN211:
addpd main_vec_$W.0.1(,%rdx,8), %xmm0 #114.9
..LN212:
movhpd -8+main_vec_$X.0.1(,%r9,8), %xmm1 #114.26
..LN213:
..LN214:
movslq 28+main_vec_$IKLE.0.1(,%rdx,4), %r15 #114.9
..LN215:
movsd -8+main_vec_$X.0.1(,%r10,8), %xmm2 #114.26
..LN216:
addpd 16+main_vec_$W.0.1(,%rdx,8), %xmm1 #114.9
..LN217:
movhpd -8+main_vec_$X.0.1(,%r11,8), %xmm2 #114.26
..LN218:
..LN219:
movsd -8+main_vec_$X.0.1(,%r14,8), %xmm3 #114.26
..LN220:
addpd 32+main_vec_$W.0.1(,%rdx,8), %xmm2 #114.9
..LN221:
movhpd -8+main_vec_$X.0.1(,%r15,8), %xmm3 #114.26
..LN222:
..LN223:
addpd 48+main_vec_$W.0.1(,%rdx,8), %xmm3 #114.9
..LN224:
movsd %xmm0, -8+main_vec_$X.0.1(,%rsi,8) #114.9
..LN225:
.loc 1 113 is_stmt 1
addq $8, %rdx #113.7
..LN226:
.loc 1 114 is_stmt 1
psrldq $8, %xmm0 #114.9
..LN227:
.loc 1 113 is_stmt 1
cmpq $26000, %rdx #113.7
..LN228:
.loc 1 114 is_stmt 1
movsd %xmm0, -8+main_vec_$X.0.1(,%rdi,8) #114.9
..LN229:
movsd %xmm1, -8+main_vec_$X.0.1(,%r8,8) #114.9
..LN230:
psrldq $8, %xmm1 #114.9
..LN231:
movsd %xmm1, -8+main_vec_$X.0.1(,%r9,8) #114.9
..LN232:
movsd %xmm2, -8+main_vec_$X.0.1(,%r10,8) #114.9
..LN233:
psrldq $8, %xmm2 #114.9
..LN234:
movsd %xmm2, -8+main_vec_$X.0.1(,%r11,8) #114.9
..LN235:
movsd %xmm3, -8+main_vec_$X.0.1(,%r14,8) #114.9
..LN236:
psrldq $8, %xmm3 #114.9
..LN237:
movsd %xmm3, -8+main_vec_$X.0.1(,%r15,8) #114.9
..LN238:
1) Vector Length N is a number of elements/iterations you can execute in parallel after "vectorizing" your loop (normally by putting N elements of array X into single vector register and processing them altogether by vector instruction). For simplification, think of Vector Length as value given by this formula:
Vector Length (abbreviated VL) = Vector Register Width / Sizeof (data type)
For AVX2 , Vector Register Width = 256 bit. Sizeof (double precision) = 8 bytes = 64 bits. Thus:
Vector Length (double FP, avx2) = 256 / 64 = 4
$DIR SIMD VECTORLENGTH (N) basically enforces compiler to use specified vector length (and to put N elements of array X into single vector register). That's it.
2) Unrolling and Vectorization relationship. For simplification, think of unrolling and vectorization as normally unrelated (somewhat "orthogonal") optimization techniques.
If your loop is unrolled by factor of M (M could be 2, 4,..), then it doesn't neccesarily mean that vector registers were used at all and it does not mean that your loop was parallelized in any sense. What it means instead is that M instances of original loop iterations have been grouped together into single iteration; and within given new "unwinded"/"unrolled" iteration old ex-iterations are executed sequentially, one by one (so your guessing example is absolutely correct).
The purpose of unrolling is normally making loop more "micro-architecture/memory-friendly". In more details: by making loop iterations more "fat" you normally improve the balance between pressure to your CPU resources vs. pressure to your Memory/Cache resources, especially since after unrolling you can normally reuse some data in registers more effectively.
3) Unrolling + Vectorization. It's not uncommon that Compilers simulteneously vectorize (with VL=N) and unroll (by M) certain loops. As a result, number of iterations in optimized loop is smaller than number of iterations in original loop by approximately factor of NxM, however number of elements processed in parallel (simulteneously in given moment in time) will only be N.
Thus, in your example, if loop is vectorized with VL=4, and unrolled by 2, then the pseudo-code for it might look like:
DO IELEM = 1 , NELEM, 8
[X(IKLE(IELEM)),X(IKLE(IELEM+2)), X(IKLE(IELEM+4)), X(IKLE(IELEM+6))] = ...
[X(IKLE(IELEM+1)),X(IKLE(IELEM+3)), X(IKLE(IELEM+5)), X(IKLE(IELEM+7))] = ...
ENDDO
,where square brackets "correspond" to vector register content.
4) Vectorization against Unrolling :
for loops with relatively small number of iterations (especially in C++) - it may happen that unrolling is not desirable since it partially blocks efficient vectorization (not enough iterations to execute in parallel) and (as you see from my artifical example) may somehow impact the way the data has to be loaded from memory. Different compilers have different heuristics wrt balancing Trip Counts, VL and Unrolling between each other; that's probably why unroll was disabled in your case when VL was smaller than 8.
runtime and compile-time trade-offs between trip counts, unrolling and vector length, as well as appropiate automatic suggestions (especially in case of using fresh Intel C++ or Fortran Compiler) could be explored using "Intel (Vectorization) Advisor":
5) P.S. There is a third dimension (I don't really like to talk about it).
When vectorlength requested by user is bigger than possible Vector Length on given hardware (let's say specifying vectorlength(16) for avx2 platform for double FP) or when you mix different types, then compiler can (or can not) start using a notion of "virtual vector register" and start doing double-/quad-pumping. M-pumping is kind of unrolling, but only for single instruction (i.e. pumping leads to repeating the single instruction, while unrolling leads to repeating the whole loop body). You may try to read about m-pumping in recent OpenMP books like given one. So in some cases you may end-up with superposition of a) vectorization, b) unrolling and c) double-pumping, but it's not common case and I'd avoid enforcing vectorlength > 2*ISA_VectorLength.
While profiling my application I realized that a lot of time is spent on string comparisons. So I wrote a simple benchmark and I was surprised that '==' is much slower than string::compare and strcmp! here is the code, can anyone explain why is that? or what's wrong with my code? because according to the standard '==' is just an operator overload and simply returnes !lhs.compare(rhs).
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 10000000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1 == s2)
a = i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1.compare(s2)==0)
a = i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(strcmp(s1.c_str(),s2.c_str()))
a = i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<endl;
return a;
}
And here is the result:
== took:5986.74
.compare took:0.000349
strcmp took:0.000778
And my compile flags:
CXXFLAGS = -O3 -Wall -fmessage-length=0 -std=c++1y
I use gcc 4.9 on a x86_64 linux machine.
Obviously using -o3 does some optimizations which I guess rolls out the last two loops totally; however, using -o2 still the results are weird:
for 1 billion iterations:
== took:19591
.compare took:8318.01
strcmp took:6480.35
P.S. Timer is just a wrapper class to measure spent time; I am absolutely sure about it :D
Code for Timer class:
#include <chrono>
#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_
class Timer {
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point stop;
public:
Timer(){
start = std::chrono::steady_clock::now();
stop = std::chrono::steady_clock::now();
}
virtual ~Timer() {}
inline void begin() {
start = std::chrono::steady_clock::now();
}
inline void end() {
stop = std::chrono::steady_clock::now();
}
inline double elapsedMillis() {
auto diff = stop - start;
return std::chrono::duration<double, std::milli> (diff).count();
}
inline double elapsedMicro() {
auto diff = stop - start;
return std::chrono::duration<double, std::micro> (diff).count();
}
inline double elapsedNano() {
auto diff = stop - start;
return std::chrono::duration<double, std::nano> (diff).count();
}
inline double elapsedSec() {
auto diff = stop - start;
return std::chrono::duration<double> (diff).count();
}
};
#endif /* SRC_TIMER_H_ */
UPDATE: output of improved benchmark at http://ideone.com/rGc36a
== took:21
.compare took:21
strcmp took:14
== took:21
.compare took:25
strcmp took:14
The thing that proved crucial to get it working meaningfully was "outwitting" the compiler's ability to predict the strings being compared at compile time:
// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };
if(s[i&3].compare(s[(i+1)&3])==0) // trickier to optimise
a += i; // cumulative observable side effects
Note that in general, strcmp is not functionally equivalent to == or .compare when the text may embed NULs, as the former will get to "exit early". (That's not the reason it's "faster" above, but do read below for comments re possible variations with string length/content etc..)
Discussion / Earlier answer
Just have a look at your implementation - e.g.
echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less
Search for the basic_string template, then for the operator== working on two string instances - mine is:
template<class _Elem,
class _Traits,
class _Alloc> inline
bool __cdecl operator==(
const basic_string<_Elem, _Traits, _Alloc>& _Left,
const basic_string<_Elem, _Traits, _Alloc>& _Right)
{
return (_Left.compare(_Right) == 0);
}
Notice that operator== is inline and simply calls compare. There's no way it's consistently significantly slower with normal optimisation levels enabled, though the optimiser might occasionally happen to optimise one loop better than another due to subtle side effects of surrounding code.
Your ostensible problem will have been caused by e.g. your code being optimised beyond the point of doing the intended work, for loops arbitrarily unrolled to different degrees, or other quirks or bugs in the optimisation or your timing. That's not unusual when you have unvarying inputs and loops that don't have any cumulative side-effects (i.e. the compiler can work out that intermediate values of a are not used, so only the last a = i need take affect).
So, learn to write better benchmarks. In this case, that's a bit tricky as having lots of distinct strings in memory ready to invoke the comparisons on, and selecting them in a way that the optimiser can't predict at compile time that's still fast enough not to overwhelm and obscure the impact of the string comparison code, is not an easy task. Further, beyond a point - comparing things spread across more memory makes cache affects more relevant to the benchmark, which further obscures the real comparison performance.
Still, if I were you I'd read some strings from a file - pushing each to a vector, then loop over the vector doing each of the three comparison operations between adjacent elements. Then the compiler can't possibly predict any pattern in the outcomes. You might find compare/== faster/slower than strcmp for strings often differing in the first character or three, but the other way around for long strings that are equal or only differing near the end, so make sure you try different kinds of input before you conclude you understand the performance profile.
Either your timings are screwy, or your compiler has optimised some of your code out of existence.
Think about it, ten billion operations in 0.000349 milliseconds (I'll use 0.000500 milliseconds, or half a microsecond, to make my calculations easier) means that you're performing twenty trillion operations per second.
Even if one operation could be done in a single clock cycle, that would be 20,000 GHz, a bit beyond the current crop of CPUs, even with their massively optimised pipelines and multiple cores.
And, given that the -O2 optimised figures are more on par with each other (== taking about double the time of compare), the "code optimised out of existence" possibility is looking far more likely.
The doubling of time could easily be explained as ten billion extra function calls, due to operator== needing to call compare to do its work.
As further support, examine the following table, showing figures in milliseconds (third column is simple divide-by-ten scale of second column so that both first and third columns are for a billion iterations):
-O2/1billion -O3/10billion -O3/1billion Improvement
(a) (b) (c = b / 10) (a / c)
============ ============= ============ ===========
oper== 19151 5987 599 32
compare 8319 0.0005 0.00005 166,380,000
It beggars belief that -O3 could speed up the == code by a factor of about 32 but manage to speed up the compare code by a factor of a few hundred million.
I strongly suggest you have a look at the assembler code generated by your compiler (such as with the gcc -S option) to verify that it's actually doing that work it's claiming to do.
The problem is that the compiler is making a lot of serious optimizations to your code.
Here's the modified code:
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 500000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1 == s2)
a += i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1.compare(s2)==0)
a+=i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(strcmp(s1.c_str(),s2.c_str()) == 0)
a+=i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<",a="<<a<< endl;
return a;
}
where I've added asm volatile("" : "+g"(s2)); to force the compiler to run the comparison. I've also added <<",a="< to force the compiler to compute a.
The output is now:
== took:10221.5,a=0
.compare took:10739,a=0
strcmp took:9700,a=0
Can you explain why strcmp is faster than .compare which is slower than ==? however, the speed differences are marginal, but significant.
It actually makes sense! :p
The speed analysis below is wrong - thanks to Tony D for pointing out my error. The criticisms and advice for better benchmarks still apply though.
All the previous answers deal with the compiler optimisation issues in your benchmark, but don't answer why strcmp is still slightly faster.
strcmp is likely faster (in the corrected benchmarks) due to the strings sometimes containing zeros. Since strcmp uses C-strings it can exit when it comes across the string termination char '\0'. std::string::compare() treats '\0' as just another char and continues until the end of the string array.
Since you have non-deterministically seeded the RNG, and only generated two strings, your results will change with every run of the code. (I'd advise against this in benchmarks.) Given the numbers, 28 times out of 128, there ought to be no advantage. 10 times out of 128 you will get more than a 10-fold speed up. And so on.
As well as defeating the compiler's optimiser, I would suggest that, next time, you generate a new string for each comparison iteration, allowing you to average away such effects.
Compiled the code with gcc -O3 -S --std=c++1y. The result is here. gcc version is:
gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Look at it, we can be the first loop (operator ==) is like this: (comment is added by me)
movq itr(%rip), %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L25
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
xorl %ebx, %ebx
movq -24(%rsi), %rdx ; length of string1
cmpq -24(%rdi), %rdx ; compare lengths
je .L53 ; compare content only when length is the same
.L10
; end of loop, print out follows
;....
.L53:
.cfi_restore_state
call memcmp ; compare content
xorl %edx, %edx ; zero loop count
.p2align 4,,10
.p2align 3
.L13:
testl %eax, %eax ; check result
cmove %rdx, %rbx ; a = i
addq $1, %rdx ; i++
cmpq %rbp, %rdx ; i < itr?
jne .L13
jmp .L10
; ....
.L25:
xorl %ebx, %ebx
jmp .L10
We can see that operator == is inline, only a call to memcmp is there. And for operator ==, if the length is different, the content is not compared.
Most importantly, compare is done only once. The loop content only contains i++;, a=i;, i<itr;.
For the second loop (compare()):
movq itr(%rip), %r12
movq %rax, %r13
movq %rax, 56(%rsp)
testq %r12, %r12
je .L14
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
movq -24(%rdi), %rbp
movq -24(%rsi), %r14 ; read and compare length
movq %rbp, %rdx
cmpq %rbp, %r14
cmovbe %r14, %rdx ; save the shorter length of the two string to %rdx
subq %r14, %rbp ; length difference in %rbp
call memcmp ; content is always compared
movl $2147483648, %edx ; 0x80000000 sign extended
addq %rbp, %rdx ; revert the sign bit of %rbp (length difference) and save to %rdx
testl %eax, %eax ; memcmp returned 0?
jne .L14 ; no, string different
testl %ebp, %ebp ; memcmp returned 0. Are lengths the same (%ebp == 0)?
jne .L14 ; no, string different
movl $4294967295, %eax ; string compare equal
subq $1, %r12 ; itr - 1
cmpq %rax, %rdx
cmovbe %r12, %rbx ; a = itr - 1
.L14:
; output follows
There no loop at all here.
In compare(), as it should return plus, minus, or zero based on the comparison, string content is always compared. memcmp called once.
For the third loop (strcmp()), the assembly is the most simple:
movq itr(%rip), %rbp ; itr to %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L16
movq 32(%rsp), %rsi
movq 16(%rsp), %rdi
subq $1, %rbp ; itr - 1 to %rbp
call strcmp
testl %eax, %eax ; test compare result
cmovne %rbp, %rbx ; if not equal, save itr - 1 to %rbx (a)
.L16:
These also no loop at all. strcmp is called, and if the strings are not equal (as in your code), save itr-1 to a directly.
So your benchmark cannot test the running time for operator ==, compare() or strcmp(). The are all called only once, not able to show the running time difference.
As to why operator == takes the most time, it is because for operator==, the compiler for some reason did not eliminate the loop. The loop takes time (but the loop does not contain string comparison at all).
And from the assembly shown, we may assume that operator == may be fastest, because it won't do string comparison at all if the length of the two strings are different. (Of course, under gcc4.9.1 -O3)
I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.
Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.
I was experimenting with for-loops performance when I stumbled upon the following case.
I have the following code that does the same computation in two different ways.
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
Compiling this code with
g++ main.cpp -o output --std=c++11
leads to the following result:
=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
My first question is: why is the second loop running 23% faster than the first?
On the other hand, if I compile the code with
g++ main.cpp -o output --std=c++11 -O1
The results improve a lot,
=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
and the difference in time almost disappears.
But I could not believe my eyes when I set the -O2 flag,
g++ main.cpp -o output --std=c++11 -O2
and got this:
=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?
I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.
By the way, I am compiling this code with g++ (GCC) 4.9.1.
EDIT to confirm Basile Starynkevitch's assumption
I edited the code, now main looks like this:
int main(int argc, char **argv)
{
int start = atoi(argv[1]);
int end = atoi(argv[2]);
int delta = end - start + 1;
std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
double elapsed;
std::printf("=====================================================\n");
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationB(start, delta);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationA(start, end);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
std::printf("=====================================================\n");
return 0;
}
These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.
I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.
Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.
My immediate guess would be that the second is faster, not because of the changes you made to the loop, but because it's second, so the cache is already primed when it runs.
To test the theory, I re-arranged your code to reverse the order in which the two calculations were called:
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
The result I got was:
=====================================================
Elapsed time: 0.286 s | 286.000 ms | 286000.000 us
Elapsed time: 0.271 s | 271.000 ms | 271000.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, when version A runs first, it's slower. When version B run's first, it's slower.
To confirm, I added an extra call to superCalculationB before doing the timing on either version A or B. After that, I tried running the program three times. For those three runs, I'd judge the results a tie (version A was faster once and version B was faster twice, but neither won dependably nor by a wide enough margin to be meaningful).
That doesn't prove that it's actually a cache situation as such, but does give a pretty strong indication that it's a matter of the order in which the functions are called, not the difference in the code itself.
As far as what the compiler does to make the code faster: the main thing it does is unroll a few iterations of the loop. We can get pretty much the same effect if we unroll a few iterations by hand:
uint64_t superCalculationC(int init, int end)
{
int f_end = end - ((end - init) & 7);
int i;
uint64_t total = 0;
for (i = init; i < f_end; i += 8) {
total += i;
total += i + 1;
total += i + 2;
total += i + 3;
total += i + 4;
total += i + 5;
total += i + 6;
total += i + 7;
}
for (; i < end; i++)
total += i;
return total;
}
This has a property that some might find rather odd: it's actually faster when compiled with -O2 than with -O3. When compiled with -O2, it's also about five times faster than either of the other two are when compiled with -O3.
The primary reason for the ~5x speed gain compared to the compiler's loop unrolling is that we've unrolled the loop somewhat differently (and more intelligently, IMO) than the compiler does. We compute f_end to tell us how many times the unrolled loop should execute. We execute those iterations, then we execute a separate loop to "clean up" any odd iterations at the end.
The compiler instead generates code that's roughly equivalent to something like this:
for (i = init; i < end; i += 8) {
total += i;
if (i + 1 >= end) break;
total += i + 1;
if (i + 2 >= end) break;
total += i + 2;
// ...
}
Although this is quite a bit faster than when the loop hasn't been unrolled at all, it's quite a bit faster still to eliminate those extra checks from the main loop, and execute a separate loop for any odd iterations.
Given such a trivial loop body being executed such a large number of times, you can also improve speed (when compiled with -O2) still further by unrolling more iterations of the loop. With 16 iterations unrolled, it was about twice as fast as the code above with 8 iterations unrolled:
uint64_t superCalculationC(int init, int end)
{
int first_end = end - ((end - init) & 0xf);
int i;
uint64_t total = 0;
for (i = init; i < first_end; i += 16) {
total += i + 0;
total += i + 1;
total += i + 2;
// code for `i+3` through `i+13` goes here
total += i + 14;
total += i + 15;
}
for (; i < end; i++)
total += i;
return total;
}
I haven't tried to explore the limit of gains from unrolling this particular loop, but unrolling 32 iterations nearly doubles the speed again. Depending on the processor you're using, you might get some small gains by unrolling 64 iterations, but I'd guess we're starting to approach the limits--at some point, performance gains will probably level off, then (if you unroll still more iterations) probably drop off, quite possibly dramatically.
Summary: with -O3 the compiler unrolls a number of iterations of the loop. This is extremely effective in this case, primarily because we have many executions of nearly the most trivial possible loop body. Unrolling the loop by hand is even more effective than letting the compiler do it--we can unroll more intelligently, and we can simply unroll more iterations than the compiler does. The extra intelligence can give us an improvement of around 5:1, and the extra iterations another 4:1 or so1 (at the expense of somewhat longer, slightly less readable code).
Final caveat: as always with optimization, your mileage may vary. Differences in compilers and/or processors mean you're likely to get at least somewhat different results than I did. I'd expect my hand-unrolled loop to be substantially faster than the other two in most cases, but exactly how much faster is likely to vary.
1. But note that this is comparing the hand-unrolled loop with -O2 to the original loop with -O3. When compiled with -O3, the hand-unrolled loop runs much more slowly.
Checking the assembly output is really the only way to illuminate such things.
Compiler optimisations will do a great deal of things, including things that are not strictly "standard compliant" (although, that is not the case with -O1 and -O2, to my knowledge) - for instance check, -Ofast switch.
I have found this helpful: http://gcc.godbolt.org/, and with your demo code here
-O2
Explaining the -O2 result is easy, looking at the code from godbolt change to -O2
main:
pushq %rbx
movl $.LC2, %edi
call puts
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
movl $.LC7, %edi
call puts
movl $.LC8, %edi
call puts
movl $.LC2, %edi
call puts
xorl %eax, %eax
popq %rbx
ret
There is no call to the 2 functions, further there is no compare of the results.
Now why can that be? its of course the power of optimization, the program is too simple ...
First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.
It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111
Both loops are now known to be containing only compile time values. They are further completely the same:
uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
total += i;
return total;
The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of
Rewriting Gauss's formel s=n*(n+1)
111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221
loops = 1000000111-111 = 1E9
half it as we got the double of the looked for
1000000221 * 1E9 / 2 = 500000110500000000
which is the result looked for 500000110500000000
Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.
The time noted is the minimum time for system_clock on your PC.
-O0
The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some
asm("nop");
in front of A's loop, 2-3 might do the trick.
Storeforwards also like that their values are naturally aligned.
EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.
It appears that the performance difference in the -O0 case is due to processor pipelining.
First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # end
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L7
.L8:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # todo
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L11
.L12:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L11:
movl -20(%rbp), %edx # copy init to register rdx
movl -24(%rbp), %eax # copy todo to register rax
addl %edx, %eax # [rax] += [rdx] (so [rax] = init+todo)
cmpl -12(%rbp), %eax # i < [rax]
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
In both functions, the stack layout looks like this:
Addr Content
24 end/todo
20 init
16 <empty>
12 i
08 total
04
00 <base pointer>
(Note that total is a 64-bit int and so occupies two 4-byte slots.)
These are the key lines of superCalculationA():
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().
You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.
To demonstrate that this is just an artifact, let's replace
for (int i = init; i < end; i++)
with
for (int i = init; end > i; i++)
in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:
addl $1, -12(%rbp) # i++
.L7:
movl -24(%rbp), %eax # copy end to register rax
cmpl -12(%rbp), %eax # i < [rax]
Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:
=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.
(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)
The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.
For completeness, here is the assembly (thanks #Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L7
.L8:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L7:
movl -12(%rbp), %eax
cmpl -24(%rbp), %eax
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L11
.L12:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L11:
movl -20(%rbp), %edx
movl -24(%rbp), %eax
addl %edx, %eax
cmpl -12(%rbp), %eax
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
It sure looks to me like B is doing more work.
My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.
To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:
pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu
for cpuN in cpu[0-9]* ; do
echo userspace > $cpuN/cpufreq/scaling_governor
echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked
This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.
Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.
I have yet to see any theory that even begins to explain this.
Processors are complicated. Execution time depends on many things, many of which are outside your control. Just a few possibilities:
a. Your computer probably doesn't have a constant clock speed. It could be that the clock speed is usually set rather low to avoid wasting energy / battery life / producing excessive heat. When your program starts running, the OS figures out that power is needed and increases the clock speed. To verify, change the order of the calls - if the second loop executed is always faster than the first one, that may be the reason.
b. The exact execution speed, especially for a tight loop like yours, depends on how instructions are aligned in memory. Some processors may run a loop faster if it is completely contained within one cache line instead of two, or in two cache lines instead of three. Some compilers will add nop instructions to align loops on cache lines to optimise for this, most don't. Quite possible that one of the loops was aligned better by pure luck and therefore runs faster.
c. The exact execution speed may depend on the exact order in which instructions are dispatched. Slightly different code may run at different speeds due to subtle differences in the code which may be processor dependent, and anyway may be hard for the compiler to consider.
d. There is some evidence that Intel processors may have problems with artificially short loops which may happen only with artificial benchmarks. Your code is quite close to "artificial". There have been cases discussed in other threads where very short loops ran unexpectedly slow, and adding instructions made them run faster.
Answer of first question:
1- It makes faster after doing it once for for loops but i am not sure just commenting according to my experiment results.(experiment 1 change their names(B->A,A->B) experiment 2 run one function has for loop before time checks,experiment 3 start one for loop before time checks)
2- First programs should work faster the reason is second function is does 2 operation when first function does 1 operation.
I leave here updated code which explain my answer.
Answer of second question:
I am not sure but there can be two ways coming my mind,
It can be formalize your function in some way and get rid of loops because the difference
can be destroyed by that way(like "return end-init" or "return todo" i dunno, i'm not sure)
It has -fauto_inc_dec and it can make that difference because these functions all about increaments and decreaments.
I hope it can help.
#include <cstdint>
#include <ctime>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init+todo; i++)
total += i;
return total;
}
int add(int a1,int a2){printf("multiple times added\n");return a1+a2;}
uint64_t superCalculationC(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < add(init , todo); i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::clock_t start=clock();
double elapsed;
std::printf("=====================================================\n");
superCalculationA(111, 1000000111);
start = clock();
uint64_t ret1 = superCalculationA(111, 1000000111);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = clock();
uint64_t ret2 = superCalculationB(111, 1000000000);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
I would like to ask if there is a quicker way to do my audio conversion than by iterating through all values one by one and dividing them through 32768.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
My approach works fine, but it could be quicker. However I did not find any way to speed it up.
Thank you for the help!
If your array is large enough it may be worthwhile to parallelize this for loop. OpenMP's parallel for statement is what I would use.
The function would then be:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
#pragma omp parallel for
for (int i = 0; i < uIntegers.size(); i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
with gcc you need to pass -fopenmp when you compile for the pragma to be used, on MSVC it is /openmp. Since spawning threads has a noticeable overhead, this will only be faster if you are processing large arrays, YMMV.
For maximum speed you want to convert more than one value per loop iteration. The easiest way to do that is with SIMD. Here's roughly how you'd do it with SSE2:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
__m128d scale = _mm_set_pd( 1.0 / 32768.0, 1.0 / 32768.0 );
int i = 0;
for ( ; i < uIntegers.size() - 3; i += 4)
{
__m128i x = _mm_loadu_si128(&uIntegers[i]);
__m128i y = _mm_shuffle_epi32(x, _MM_SHUFFLE(2,3,0,0) );
__m128d dx = _mm_cvtepi32_pd(x);
__m128d dy = _mm_cvtepi32_pd(y);
dx = _mm_mul_pd(dx, scale);
dy = _mm_mul_pd(dy, scale);
_mm_storeu_pd(dx, &uDoubles[i]);
_mm_storeu_pd(dy, &uDoubles[i + 2]);
}
// Finish off the last 0-3 elements the slow way
for ( ; i < uIntegers.size(); i ++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
We process four integers per loop iteration. As we can only fit two doubles in the registers there's some duplicated work, but the extra unrolling will help performance unless the arrays are tiny.
Changing the data types to smaller ones (say short and float) should also help performance, because they cut down on memory bandwidth, and you can fit four floats in an SSE register. For audio data you shouldn't need the precision of a double.
Note that I've used unaligned loads and stores. Aligned ones will be slightly quicker if the data is actually aligned (which it won't be by default, and it's hard to make stuff aligned inside a std::vector).
Your function is highly parallelizable. On modern Intel CPU there are three independent ways to parallelize: Instruction level parallelism (ILP), thread level parallelism (TLP), and SIMD. I was able to use all three to get big boosts in your function. The results are compiler dependent though. The boost is much less using GCC since it already vectorizes the function. See the table of numbers below.
However, the main limiting factor in your function is that it's time complexity is only O(n) and so there is a drastic drop in efficiency when the size of the array you're running over crosses each cache level boundary. If you look at for example at large dense matrix multiplication (GEMM) it's a O(n^3) operation so if one does things right (using e.g. loop tiling) the cache hierarchy is not a problem: you can get close to the maximum flops/s even for very large matrices (which seems to indicate that GEMM is one of the thinks Intel thinks of when they design the CPU). The way to fix this in your case is to find a way to do your function on a L1 cache block right after/before you do a more complex operation (for example that goes as O(n^2)) and then move to another L1 block. Of course I don't know what you're doing so I can't do that.
ILP is partially done for you by the CPU hardware. However, often carried loop dependencies limit the ILP so it often helps to do loop unrolling to take full advantage of the ILP. For TLP I use OpenMP, and for SIMD I used AVX (however the code below works for SSE as well). I used 32 byte aligned memory and made sure the array was a multiple of 8 so that no clean up was necessary.
Here are the results from Visual Studio 2012 64bit with AVX and OpenMP (release mode obviously) SandyBridge EP 4 cores (8 HW threads) #3.6 GHz. The variable n is the number of items. I repeat the function several times as well so the total time includes that. The function convert_vec4_unroll2_openmp gives the best results except in the L1 region. You can also cleary see that the efficiency drops significantly each time you move to a new cache level but even for main memory it's still better.
l1 chache, n 2752, repeat 300000
covert time 1.34, error 0.000000
convert_vec4 time 0.16, error 0.000000
convert_vec4_unroll2 time 0.16, error 0.000000
convert_vec4_unroll2_openmp time 0.31, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 1.14, error 0.000000
convert_vec4 time 0.24, error 0.000000
convert_vec4_unroll2 time 0.24, error 0.000000
convert_vec4_unroll2_openmp time 0.12, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 1.23, error 0.000000
convert_vec4 time 0.44, error 0.000000
convert_vec4_unroll2 time 0.45, error 0.000000
convert_vec4_unroll2_openmp time 0.14, error 0.000000
main memory , n 8738144, repeat 100
covert time 1.56, error 0.000000
convert_vec4 time 0.95, error 0.000000
convert_vec4_unroll2 time 0.89, error 0.000000
convert_vec4_unroll2_openmp time 0.51, error 0.000000
Results with g++ foo.cpp -mavx -fopenmp -ffast-math -O3 on a i5-3317 (ivy bridge) # 2.4 GHz 2 cores (4 HW threads). GCC seems to vectorize this and the only benefit comes from OpenMP (which, however, gives a worse result in the L1 region).
l1 chache, n 2752, repeat 300000
covert time 0.26, error 0.000000
convert_vec4 time 0.22, error 0.000000
convert_vec4_unroll2 time 0.21, error 0.000000
convert_vec4_unroll2_openmp time 0.46, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 0.28, error 0.000000
convert_vec4 time 0.27, error 0.000000
convert_vec4_unroll2 time 0.27, error 0.000000
convert_vec4_unroll2_openmp time 0.20, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 0.80, error 0.000000
convert_vec4 time 0.80, error 0.000000
convert_vec4_unroll2 time 0.80, error 0.000000
convert_vec4_unroll2_openmp time 0.83, error 0.000000
main memory chache, n 8738144, repeat 100
covert time 1.10, error 0.000000
convert_vec4 time 1.10, error 0.000000
convert_vec4_unroll2 time 1.10, error 0.000000
convert_vec4_unroll2_openmp time 1.00, error 0.000000
Here is the code. I use the vectorclass http://www.agner.org/optimize/vectorclass.zip to do SIMD. This will use either AVX to write 4 doubles at once or SSE to write 2 doubles at once.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "vectorclass.h"
void convert(const int *uIntegers, double *uDoubles, const int n) {
for ( int i = 0; i<n; i++) {
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
void convert_vec4(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d*=div;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
void convert_vec4_openmp(const int *uIntegers, double *uDoubles, const int n) {
#pragma omp parallel for
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d/=32768.0;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2_openmp(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
#pragma omp parallel for
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
double compare(double *a, double *b, const int n) {
double diff = 0.0;
for(int i=0; i<n; i++) {
double tmp = a[i] - b[i];
//printf("%d %f %f \n", i, a[i], b[i]);
if(tmp<0) tmp*=-1;
diff += tmp;
}
return diff;
}
void loop(const int n, const int repeat, const int ifunc) {
void (*fp[4])(const int *uIntegers, double *uDoubles, const int n);
int *a = (int*)_mm_malloc(sizeof(int)* n, 32);
double *b1_cmp = (double*)_mm_malloc(sizeof(double)*n, 32);
double *b1 = (double*)_mm_malloc(sizeof(double)*n, 32);
double dtime;
const char *fp_str[] = {
"covert",
"convert_vec4",
"convert_vec4_unroll2",
"convert_vec4_unroll2_openmp",
};
for(int i=0; i<n; i++) {
a[i] = rand()*RAND_MAX;
}
fp[0] = convert;
fp[1] = convert_vec4;
fp[2] = convert_vec4_unroll2;
fp[3] = convert_vec4_unroll2_openmp;
convert(a, b1_cmp, n);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) {
fp[ifunc](a, b1, n);
}
dtime = omp_get_wtime() - dtime;
printf("\t%s time %.2f, error %f\n", fp_str[ifunc], dtime, compare(b1_cmp,b1,n));
_mm_free(a);
_mm_free(b1_cmp);
_mm_free(b1);
}
int main() {
double dtime;
int l1 = (32*1024)/(sizeof(int) + sizeof(double));
int l2 = (256*1024)/(sizeof(int) + sizeof(double));
int l3 = (8*1024*1024)/(sizeof(int) + sizeof(double));
int lx = (100*1024*1024)/(sizeof(int) + sizeof(double));
int n[] = {l1, l2, l3, lx};
int repeat[] = {300000, 30000, 1000, 100};
const char *cache_str[] = {"l1", "l2", "l3", "main memory"};
for(int c=0; c<4; c++ ) {
int lda = ((n[c]+7) & -8); //make sure array is a multiple of 8
printf("%s chache, n %d\n", cache_str[c], lda);
for(int i=0; i<4; i++) {
loop(lda, repeat[c], i);
} printf("\n");
}
}
Lastly, anyone who has read this far and feels like reminding me that my code looks more like C than C++ please read this first before you decide to comment http://www.stroustrup.com/sibling_rivalry.pdf
You might also try:
uDoubles[i] = ldexp((double)uIntegers[i], -15);
Edit: See Adam's answer above for a version using SSE intrinsics. Better than what I had here ...
To make this more useful, let's look at compiler-generated code here. I'm using gcc 4.8.0 and yes, it is worth checking your specific compiler (version) as there are quite significant differences in output for, say, gcc 4.4, 4.8, clang 3.2 or Intel's icc.
Your original, using g++ -O8 -msse4.2 ... translates into the following loop:
.L2:
cvtsi2sd (%rcx,%rax,4), %xmm0
mulsd %xmm1, %xmm0
addl $1, %edx
movsd %xmm0, (%rsi,%rax,8)
movslq %edx, %rax
cmpq %rdi, %rax
jbe .L2
where %xmm1 holds 1.0/32768.0 so the compiler automatically turns the division into multiplication-by-reverse.
On the other hand, using g++ -msse4.2 -O8 -funroll-loops ..., the code created for the loop changes significantly:
[ ... ]
leaq -1(%rax), %rdi
movq %rdi, %r8
andl $7, %r8d
je .L3
[ ... insert a duff's device here, up to 6 * 2 conversions ... ]
jmp .L3
.p2align 4,,10
.p2align 3
.L39:
leaq 2(%rsi), %r11
cvtsi2sd (%rdx,%r10,4), %xmm9
mulsd %xmm0, %xmm9
leaq 5(%rsi), %r9
leaq 3(%rsi), %rax
leaq 4(%rsi), %r8
cvtsi2sd (%rdx,%r11,4), %xmm10
mulsd %xmm0, %xmm10
cvtsi2sd (%rdx,%rax,4), %xmm11
cvtsi2sd (%rdx,%r8,4), %xmm12
cvtsi2sd (%rdx,%r9,4), %xmm13
movsd %xmm9, (%rcx,%r10,8)
leaq 6(%rsi), %r10
mulsd %xmm0, %xmm11
mulsd %xmm0, %xmm12
movsd %xmm10, (%rcx,%r11,8)
leaq 7(%rsi), %r11
mulsd %xmm0, %xmm13
cvtsi2sd (%rdx,%r10,4), %xmm14
mulsd %xmm0, %xmm14
cvtsi2sd (%rdx,%r11,4), %xmm15
mulsd %xmm0, %xmm15
movsd %xmm11, (%rcx,%rax,8)
movsd %xmm12, (%rcx,%r8,8)
movsd %xmm13, (%rcx,%r9,8)
leaq 8(%rsi), %r9
movsd %xmm14, (%rcx,%r10,8)
movsd %xmm15, (%rcx,%r11,8)
movq %r9, %rsi
.L3:
cvtsi2sd (%rdx,%r9,4), %xmm8
mulsd %xmm0, %xmm8
leaq 1(%rsi), %r10
cmpq %rdi, %r10
movsd %xmm8, (%rcx,%r9,8)
jbe .L39
[ ... out ... ]
So it blocks the operations up, but still converts one-value-at-a-time.
If you change your original loop to operate on a few elements per iteration:
size_t i;
for (i = 0; i < uIntegers.size() - 3; i += 4)
{
uDoubles[i] = uIntegers[i] / 32768.0;
uDoubles[i+1] = uIntegers[i+1] / 32768.0;
uDoubles[i+2] = uIntegers[i+2] / 32768.0;
uDoubles[i+3] = uIntegers[i+3] / 32768.0;
}
for (; i < uIntegers.size(); i++)
uDoubles[i] = uIntegers[i] / 32768.0;
the compiler, gcc -msse4.2 -O8 ... (i.e. even without requesting unrolling), identifies the potential to use CVTDQ2PD/MULPD and the core of the loop becomes:
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rcx), %xmm0
addq $16, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movlpd %xmm1, (%rdx,%rax,8)
movhpd %xmm1, 8(%rdx,%rax,8)
movlpd %xmm0, 16(%rdx,%rax,8)
movhpd %xmm0, 24(%rdx,%rax,8)
addq $4, %rax
cmpq %r8, %rax
jb .L4
cmpq %rdi, %rax
jae .L29
[ ... duff's device style for the "tail" ... ]
.L29:
rep ret
I.e. now the compiler recognizes the opportunity to put two double per SSE register, and do parallel multiply / conversion. This is pretty close to the code that Adam's SSE intrinsics version would generate.
The code in total (I've shown only about 1/6th of it) is much more complex than the "direct" intrinsics, due to the fact that, as mentioned, the compiler tries to prepend/append unaligned / not-block-multiple "heads" and "tails" to the loop. It largely depends on the average/expected sizes of your vectors whether this will be beneficial or not; for the "generic" case (vectors more than twice the size of the block processed by the "innermost" loop), it'll help.
The result of this exercise is, largely ... that, if you coerce (by compiler options/optimization) or hint (by slightly rearranging the code) your compiler to do the right thing, then for this specific kind of copy/convert loop, it comes up with code that's not going to be much behind hand-written intrinsics.
Final experiment ... make the code:
static double c(int x) { return x / 32768.0; }
void Convert(const std::vector<int>& uIntegers, std::vector<double>& uDoubles)
{
std::transform(uIntegers.begin(), uIntegers.end(), uDoubles.begin(), c);
}
and (for the nicest-to-read assembly output, this time using gcc 4.4 with gcc -O8 -msse4.2 ...) the generated assembly core loop (again, there's a pre/post bit) becomes:
.p2align 4,,10
.p2align 3
.L8:
movdqu (%r9,%rax), %xmm0
addq $1, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movapd %xmm1, (%rsi,%rax,2)
movapd %xmm0, 16(%rsi,%rax,2)
addq $16, %rax
cmpq %rcx, %rdi
ja .L8
cmpq %rbx, %rbp
leaq (%r11,%rbx,4), %r11
leaq (%rdx,%rbx,8), %rdx
je .L10
[ ... ]
.L10:
[ ... ]
ret
With that, what do we learn ? If you want to use C++, really use C++ ;-)
Let me try another way:
If multiplying is seriously better from the perspective of assembly instructions, then this should guarantee that it will get multiplied.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] * 0.000030517578125;
}
}