Add+Mul become slower with Intrinsics - where am I wrong? - c++

Having this array:
alignas(16) double c[voiceSize][blockSize];
This is the function I'm trying to optimize:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = value + deltaValue * sampleIndex;
}
}
And this is my intrinsics (SSE2) attempt:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
__m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
Which is slower than "scalar" (even if auto-optimized) original code, unfortunately :)
Where's the bottleneck in your opinion? Where am I wrong?
I'm using MSVC, Release/x86, /02 optimization flag (Favor fast code).
EDIT: doing this (suggested by #wim), it seems that performance become better than C version:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
__m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
__m128d sampleIndex_add = _mm_set1_pd(2.0);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
__m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
Why? Is _mm_setr_pd expensive?

Why? Is _mm_setr_pd expensive?
Somewhat; it takes at least a shuffle. More importantly in this case, computing each scalar operand is expensive, and as #spectras' answer shows, gcc at least fails to auto-vectorize that into paddd / cvtdq2pd. Instead it re-computes each operand from a scalar integer, doing the int->double conversion separately, then shuffles those together.
This is the function I'm trying to optimize:
You're simply filling an array with a linear function. You're re-multiplying every time inside the loop. That avoids a loop-carried dependency on anything except the integer loop counter, but you run into throughput bottlenecks from doing so much work inside the loop.
i.e. you're computing a[i] = c + i*scale separately for every step. But instead you can strength-reduce that to a[i+n] = a[i] + (n*scale). So you only have one addpd instruction per vector of results.
This will introduce some rounding error that accumulates vs. redoing the computation from scratch, but double is probably overkill for what you're doing anyway.
It also comes at the cost of introducing a serial dependency on an FP add instead of integer. But you already have a loop-carried FP add dependency chain in your "optimized" version that uses sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add); inside the loop, using FP += 2.0 instead of re-converting from integer.
So you'll want to unroll with multiple vectors to hide that FP latency, and keep at least 3 or 4 FP additions in flight at once. (Haswell: 3 cycle latency, one per clock throughput. Skylake: 4 cycle latency, 2 per clock throughput.) See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators for a similar problem with loop-carried dependencies (a dot product).
void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double val0 = start + step * delta;
double deltaValue = rate * delta;
__m128d vdelta2 = _mm_set1_pd(2 * deltaValue);
__m128d vdelta4 = _mm_add_pd(vdelta2, vdelta2);
__m128d v0 = _mm_setr_pd(val0, val0 + deltaValue);
__m128d v1 = _mm_add_pd(v0, vdelta2);
__m128d v2 = _mm_add_pd(v0, vdelta4);
__m128d v3 = _mm_add_pd(v1, vdelta4);
__m128d vdelta8 = _mm_mul_pd(vdelta2, _mm_set1_pd(4.0));
double *endp = pC + blocksize - 7; // stop if there's only room for 7 or fewer doubles
// or use -8 and have your cleanup handle lengths of 1..8
// since the inner loop always calculates results for next iteration
for (; pC < endp ; pC += 8) {
_mm_store_pd(pC, v0);
v0 = _mm_add_pd(v0, vdelta8);
_mm_store_pd(pC+2, v1);
v1 = _mm_add_pd(v1, vdelta8);
_mm_store_pd(pC+4, v2);
v2 = _mm_add_pd(v2, vdelta8);
_mm_store_pd(pC+6, v3);
v3 = _mm_add_pd(v3, vdelta8);
}
// if (blocksize % 8 != 0) ... store final vectors
}
The choice of whether to add or multiply when building up vdelta4 / vdelta8 is not very significant; I tried to avoid too long a dependency chain before the first stores can happen. Since v0 through v3 need to be calculated as well, it seemed to make sense to create a vdelta4 instead of just making a chain of v2 = v1+vdelta2. Maybe it would have been better to create vdelta4 with a multiply from 4.0*delta, and double it to get vdelta8. This could be relevant for very small block size, especially if you cache-block your code by only generating small chunks of this array as needed, right before it will be read.
Anyway, this compiles to a very efficient inner loop with gcc and MSVC (on the Godbolt compiler explorer).
;; MSVC -O2
$LL4#Process: ; do {
movups XMMWORD PTR [rax], xmm5
movups XMMWORD PTR [rax+16], xmm0
movups XMMWORD PTR [rax+32], xmm1
movups XMMWORD PTR [rax+48], xmm2
add rax, 64 ; 00000040H
addpd xmm5, xmm3 ; v0 += vdelta8
addpd xmm0, xmm3 ; v1 += vdelta8
addpd xmm1, xmm3 ; v2 += vdelta8
addpd xmm2, xmm3 ; v3 += vdelta8
cmp rax, rcx
jb SHORT $LL4#Process ; }while(pC < endp)
This has 4 separate dependency chains, through xmm0, 1, 2, and 5. So there's enough instruction-level parallelism to keep 4 addpd instructions in flight. This is more than enough for Haswell, but half of what Skylake can sustain.
Still, with a store throughput of 1 vector per clock, more than 1 addpd per clock isn't useful. In theory this can run at about 16 bytes per clock cycle, and saturate store throughput. i.e. 1 vector / 2 doubles per clock.
AVX with wider vectors (4 doubles) could still go at 1 vector per clock on Haswell and later, i.e. 32 bytes per clock. (Assuming the output array is hot in L1d cache or possibly even L2.)
Even better: don't store this data in memory at all; re-generate on the fly.
Generate it on the fly when it's needed, if the code consuming it only reads it a few times, and is also manually vectorized.

On my system, g++ test.cpp -march=native -O2 -c -o test
This will output for the normal version (loop body extract):
30: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
34: c5 fb 2a c0 vcvtsi2sd %eax,%xmm0,%xmm0
38: c4 e2 f1 99 c2 vfmadd132sd %xmm2,%xmm1,%xmm0
3d: c5 fb 11 04 c2 vmovsd %xmm0,(%rdx,%rax,8)
42: 48 83 c0 01 add $0x1,%rax
46: 48 39 c8 cmp %rcx,%rax
49: 75 e5 jne 30 <_Z11ProcessAutoii+0x30>
And for the intrinsics version:
88: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
8c: 8d 50 01 lea 0x1(%rax),%edx
8f: c5 f1 57 c9 vxorpd %xmm1,%xmm1,%xmm1
93: c5 fb 2a c0 vcvtsi2sd %eax,%xmm0,%xmm0
97: c5 f3 2a ca vcvtsi2sd %edx,%xmm1,%xmm1
9b: c5 f9 14 c1 vunpcklpd %xmm1,%xmm0,%xmm0
9f: c4 e2 e9 98 c3 vfmadd132pd %xmm3,%xmm2,%xmm0
a4: c5 f8 29 04 c1 vmovaps %xmm0,(%rcx,%rax,8)
a9: 48 83 c0 02 add $0x2,%rax
ad: 48 39 f0 cmp %rsi,%rax
b0: 75 d6 jne 88 <_Z11ProcessSSE2ii+0x38>
So in short: the compiler automatically generates AVX code from the C version.
Edit after playing a bit more with flags to have SSE2 only in both cases:
g++ test.cpp -msse2 -O2 -c -o test
The compiler still does something different from what you generate with intrinsics. Compiler version:
30: 66 0f ef c0 pxor %xmm0,%xmm0
34: f2 0f 2a c0 cvtsi2sd %eax,%xmm0
38: f2 0f 59 c2 mulsd %xmm2,%xmm0
3c: f2 0f 58 c1 addsd %xmm1,%xmm0
40: f2 0f 11 04 c2 movsd %xmm0,(%rdx,%rax,8)
45: 48 83 c0 01 add $0x1,%rax
49: 48 39 c8 cmp %rcx,%rax
4c: 75 e2 jne 30 <_Z11ProcessAutoii+0x30>
Intrinsics version:
88: 66 0f ef c0 pxor %xmm0,%xmm0
8c: 8d 50 01 lea 0x1(%rax),%edx
8f: 66 0f ef c9 pxor %xmm1,%xmm1
93: f2 0f 2a c0 cvtsi2sd %eax,%xmm0
97: f2 0f 2a ca cvtsi2sd %edx,%xmm1
9b: 66 0f 14 c1 unpcklpd %xmm1,%xmm0
9f: 66 0f 59 c3 mulpd %xmm3,%xmm0
a3: 66 0f 58 c2 addpd %xmm2,%xmm0
a7: 0f 29 04 c1 movaps %xmm0,(%rcx,%rax,8)
ab: 48 83 c0 02 add $0x2,%rax
af: 48 39 f0 cmp %rsi,%rax
b2: 75 d4 jne 88 <_Z11ProcessSSE2ii+0x38>
Compiler does not unroll the loop here. It might be better or worse depending on many things. You might want to bench both versions.

Related

Assuming a is double, is 2.0*a faster than 2*a?

Long time ago in some book about the ancient FORTRAN I have seen the claim that using the integer constant with floating point variable is slower, as the constant needs to be converted to the floating point form first:
double a = ..;
double b = a*2; // 2 -> 2.0 first
double c = a*2.0;
Is it still beneficial to write 2.0 rather than 2 in the modern C++? If not, probably the "integer version" should be preferred as 2.0 is longer and does not make any difference for a human reader.
I work with complex, long expressions where these ".0"s would make a difference in either performance or readability, if any applies.
First to cover other answers, no 2 vs 2.0 will not cause a performance difference, this will be checked at compile time to create the correct value. However to answer the question:
Is it still beneficial to write 2.0 rather than 2 in the modern C++?
Absolutely.
But it's not because of performance, but readability and bugs. Imagine the following operation:
double a = (2 / someOtherNumber) * someFloat;
What is the type of someOtherNumber? Because if it is an integer type then you are in trouble because of integer division. 2.0 or 2.0f has the distinct advantages:
Tells the reader of the code exactly what you intended.
Avoids mistakes from integer division where you didn't intend it.
Original question:
Let's compare the assembly output.
double foo(double a)
{
return a * 2;
}
double bar(double a)
{
return a * 2.0f;
}
double baz(double a)
{
return a * 2.0;
}
results in
0000000000000000 <foo>: //double x int
0: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
4: c3 retq // return (quad)
5: 90 nop // padding
6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) // padding
d: 00 00 00
0000000000000010 <bar>: //double x float
10: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
14: c3 retq // return (quad)
15: 90 nop // padding
16: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) // padding
1d: 00 00 00
0000000000000020 <baz>: //double x double
20: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
24: c3 retq // return (quad)
As you can see, they are all equal and do not perform a multiplication at all.
Even when doing real multiplication (a*5), they are all equal and perform down to
0: f2 0f 59 05 00 00 00 mulsd 0x0(%rip),%xmm0 # 8 <foo+0x8>
7: 00
8: c3 retq
Addition:
#Goswin-Von-Brederlow remarks, that using a non constant expression will lead to different assembly. Let's test this like the one above, but with the following signature.
double foo(double a, int b); //int, float, double for foo/bar/baz
which leads to the output:
0000000000000000 <foo>: //double x int
0: 66 0f ef c9 pxor %xmm1,%xmm1 // clear xmm1
4: f2 0f 2a cf cvtsi2sd %edi,%xmm1 // convert edi (second argument) to double
8: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul xmm1 with xmm0
c: c3 retq // return
d: 0f 1f 00 nopl (%rax) // padding
0000000000000010 <bar>: //double x float
10: f3 0f 5a c9 cvtss2sd %xmm1,%xmm1 // convert float to double
14: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul
18: c3 retq // return
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) // padding
0000000000000020 <baz>: //double x double
20: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul directly
24: c3 retq // return
Here you can see the (runtime) conversion from the types to a double, which leads of course to (runtime) overhead.
No.
The following code:
double f1(double a) {
double b = a*2;
return b;
}
double f2(double a) {
double c = a*2.0;
return c;
}
... when compiled on gcc.godbolt.org with Clang, produces the following assembly:
f1(double): # #f1(double)
addsd xmm0, xmm0
ret
f2(double): # #f2(double)
addsd xmm0, xmm0
ret
You can see that both functions are perfectly identical, and the compiler even replaced the multiplication by an addition. I'd expect the same for any C++ compiler from this millenium -- trust them, they're pretty smart.
No, it's not faster. Why would a compiler wait until runtime to convert an integer to a floating point number, if it knew what that number was going to be? I suppose it's possible that you might convince some exceedingly pedantic compiler to do that, if you disabled optimization completely, but all the compilers I'm aware of would do that optimization as a matter of course.
Now, if you were doing a*b with a a floating point type and b an integer type, and neither one was a compile-time literal, on some architectures that could cause a significant performance hit (particularly if you'd calculated b very recently). But in the case of literals, the compiler already has your back.

Compiler choice of not using REP MOVSB instruction for a byte array move

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w

Reimplementing std::swap() with static tmp variable for simple types C++

I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:
#include <iostream>
#include <chrono>
#include <utility>
#include <vector>
template<typename T>
void mySwap(T& a, T& b) //Like std::swap - just for tests
{
T tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
template<typename T>
void mySwapStatic(T& a, T& b) //Here with static tmp
{
static T tmp;
tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
class Test1 { //Simple class with some simple types
int foo;
float bar;
char bazz;
};
class Test2 { //Class with std::vector in it
int foo;
float bar;
char bazz;
std::vector<int> bizz;
public:
Test2()
{
bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
}
};
#define Test Test1 //choosing class
const static unsigned int NUM_TESTS = 100000000;
static Test a, b; //making it static to prevent throwing out from code by compiler optimizations
template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction ) //test function
{
std::chrono::system_clock::time_point t1, t2;
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
swapFunction(a, b);
}
t2 = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}
int main()
{
std::chrono::system_clock::time_point t1, t2;
std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n"; //caling test function
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwap<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwapStatic<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n"; //calling test function with std::swap for comparsion. Mostly similar to 1...
return 0;
}
Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):
Test 1. MySwap Result: 625,105,480 nanoseconds
Test 2. MySwap2 Result: 528,701,547 nanoseconds
Test 3. MySwapStatic Result: 338,484,180 nanoseconds
Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds
Test 5. std::swap Result: 564,863,184 nanoseconds
My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.
Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?
An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).
Overall, it is probably a bad idea to use a static temp variable.
Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.
One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.
Your test cases are flawed because they don't show this point.
Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)
I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)
The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.
Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.
Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end
GCC non-static (perf 570ms):
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402FA2 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FA7 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FAB 44 89 02 mov dword ptr [rdx],r8d
00402FAE F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FB3 88 42 08 mov byte ptr [rdx+8],al
GCC static and MSVC static (perf 275ms):
00402F10 48 8B 01 mov rax,qword ptr [rcx]
00402F13 48 89 05 66 11 00 00 mov qword ptr [404080h],rax
00402F1A 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F1E 88 05 64 11 00 00 mov byte ptr [404088h],al
00402F24 48 8B 02 mov rax,qword ptr [rdx]
00402F27 48 89 01 mov qword ptr [rcx],rax
00402F2A 0F B6 42 08 movzx eax,byte ptr [rdx+8]
00402F2E 88 41 08 mov byte ptr [rcx+8],al
00402F31 48 8B 05 48 11 00 00 mov rax,qword ptr [404080h]
00402F38 48 89 02 mov qword ptr [rdx],rax
00402F3B 0F B6 05 46 11 00 00 movzx eax,byte ptr [404088h]
00402F42 88 42 08 mov byte ptr [rdx+8],al
MSVC non-static (perf 215ms):
00000 f2 0f 10 02 movsdx xmm0, QWORD PTR [rdx]
00004 f2 0f 10 09 movsdx xmm1, QWORD PTR [rcx]
00008 44 8b 41 08 mov r8d, DWORD PTR [rcx+8]
0000c f2 0f 11 01 movsdx QWORD PTR [rcx], xmm0
00010 8b 42 08 mov eax, DWORD PTR [rdx+8]
00013 89 41 08 mov DWORD PTR [rcx+8], eax
00016 f2 0f 11 0a movsdx QWORD PTR [rdx], xmm1
0001a 44 89 42 08 mov DWORD PTR [rdx+8], r8d
std::swap versions are all identical to the non-static versions.
After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.
GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402F9C 44 8B 0A mov r9d,dword ptr [rdx]
00402F9F 44 89 09 mov dword ptr [rcx],r9d
00402FA2 44 8B 4A 04 mov r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04 mov dword ptr [rcx+4],r9d
00402FAA 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FAF 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FB3 44 89 02 mov dword ptr [rdx],r8d
00402FB6 F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FBB 88 42 08 mov byte ptr [rdx+8],al
Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.

Local variable vs. array access

Which of these would be more computationally efficient, and why?
A) Repeated array access:
for(i=0; i<numbers.length; i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
B) Setting a local variable:
for(i=0; i<numbers.length; i++) {
int n = numbers[i];
result[i] = n * n * n;
}
Would not the repeated array access version have to be calculated (using pointer arithmetic), making the first option slower because it is doing this?:
for(i=0; i<numbers.length; i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
Any sufficiently sophisticated compiler will generate the same code for all three solutions. I turned your three versions into a small C program (with a minor adjustement, I changed the access numbers.length to a macro invocation which gives the length of an array):
#include <stddef.h>
size_t i;
static const int numbers[] = { 0, 1, 2, 4, 5, 6, 7, 8, 9 };
#define ARRAYLEN(x) (sizeof((x)) / sizeof(*(x)))
static int result[ARRAYLEN(numbers)];
void versionA(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
}
void versionB(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
int n = numbers[i];
result[i] = n * n * n;
}
}
void versionC(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
}
I then compiled it using optimizations (and debug symbols, for prettier disassembly) with Visual Studio 2012:
C:\Temp>cl /Zi /O2 /Wall /c so19244189.c
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
so19244189.c
Finally, here's the disassembly:
C:\Temp>dumpbin /disasm so19244189.obj
[..]
_versionA:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionB:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionC:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
Note how the assembly is exactly the same in all cases. So the correct answer to your question
Which of these would be more computationally efficient, and why?
for this compiler is: mu. Your question cannot be answered because it's based on incorrect assumptions. None of the answers is faster than any other.
The theoretical answer:
A reasonably good optimizing compiler should convert version A to version B, and perform only one load from memory. There should be no performance difference if optimization is enabled.
If optimization is disabled, version A will be slower, because the address must be computed 3 times and there are 3 memory loads (2 of them are cached and very fast, but it's still slower than reusing a register).
In practice, the answer will depend on your compiler, and you should check this by benchmarking.
It depends on compiler but all of them should be the same.
First lets look at case B smart compiler will generate code to load value into register only once so it doesn't matter if you use some additional variable or not, compiler generates opcode for mov instruction and has value into register. So B is the same as A.
Now lets compare A and C. We should look at opeators [] inline implementation. a[b] actually is *(a + b) so *(numbers + i) the same as numbers[i] that means cases A and C are the same.
So we have (A==B) && (A==C) all in all (A==B==C) If you know what I mean :).

What is the difference between bit shifting and arithmetical operations?

int aNumber;
aNumber = aValue / 2;
aNumber = aValue >> 1;
aNumber = aValue * 2;
aNumber = aValue << 1;
aNumber = aValue / 4;
aNumber = aValue >> 2;
aNumber = aValue * 8;
aNumber = aValue << 3;
// etc.
Whats is the "best" way to do operations? When is better to use bit shifting?
The two are functionally equivalent in the examples you gave (except for the final one, which ought to read aValue * 8 == aValue << 3), if you are using positive integers. This is only the case when multiplying or dividing by powers of 2.
Bit shifting is never slower than arithmetic. Depending on your compiler, the arithmetic version may be compiled down to the bit-shifting version, in which case they both be as efficient. Otherwise, bit-shifting should be significantly faster than arithmetic.
The arithmetic version is often more readable, however. Consequently, I use the arithmetic version in almost all cases, and only use bit shifting if profiling reveals that the statement is in a bottleneck:
Programs should be written for people to read, and only incidentally for machines to execute.
The difference is that arithmetic operations have clearly defined results (unless they run into signed overflow that is). Shift operations don't have defined results in many cases. They are clearly defined for unsigned types in both C and C++, but with signed types things quickly get tricky.
In C++ language the arithmetical meaning of left-shift << for signed types is not defined. It just shifts bits, filling with zeros on the right. What it means in arithmetical sense depends on the signed representation used by the platform. Virtually the same is true for right-shift >> operator. Right-shifting negative values leads to implementation-defined results.
In C language things are defined slightly differently. Left-shifting negative values is impossible: it leads to undefined behavior. Right-shifting negative values leads to implementation-defined results.
On most practical implementations each single right-shift performs division by 2 with rounding towards negative infinity. This, BTW, is notably different from the arithmetic division / by 2, since typically (and always in C99) of the time it will round towards 0.
As for when you should use bit-shifting... Bit-shifting is for operations that work on bits. Bit-shifting operators are very rarely used as a replacement for arithmetic operators (for example, you should never use shifts to perform multiplication/division by constant).
Bit shifting is a 'close to the metal' operation that most of the time doesn't contain any information on what you really want to achieve.
If you want to divide a number by two, by all means, write x/2. It happens to be achieved by x >> 1, but the latter conceals the intent.
When that turns out to become a bottleneck, revise the code.
Whats is the "best" way to do operations?
Use arithmetic operations when dealing with numbers. Use bit operations when dealing with bits. Period. This is common sense. I doubt anyone would ever think using bit shift operations for ints or doubles as a regular day-to-day thing is a good idea.
When is better to use bit shifting?
When dealing with bits?
Additional question: do they behave the same in case of arithmetic overflow?
Yes. Appropriate arithmetic operations are (often, but not always) simplified to their bit shift counterparts by most modern compilers.
Edit: Answer was accepted, but I just want to add that there's a ton of bad advice in this question. You should never (read: almost never) use bit shift operations when dealing with ints. It's horrible practice.
When your goal is to multiply some numbers, using arithmetic operators makes sense.
When your goals is to actually logically shift the bits, then use the shift operators.
For instance, say you are splitting the RGB components from an RGB word, this code makes sense:
int r,g,b;
short rgb = 0x74f5;
b = rgb & 0x001f;
g = (rgb & 0x07e0) >> 5;
r = (rgb & 0xf800) >> 11;
on the other hand when you want to multiply some value with 4 you should really code your intent, and not do shifts.
As long as you are multiplying or dividing within the 2er powers it is faster to operate with a shift because it is a single operation (needs only one process cycle).
One gets used to reading << 1 as *2 and >>2 as /4 quite quickly so I do not agree with readability going away when using shifting but this is up to each person.
If you want to know more details about how and why, maybe wikipedia can help or if you want to go through the pain learn assembly ;-)
As an example of the differences, this is x86 assembly created using gcc 4.4 with -O3
int arithmetic0 ( int aValue )
{
return aValue / 2;
}
00000000 <arithmetic0>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 89 c2 mov %eax,%edx
9: c1 ea 1f shr $0x1f,%edx
c: 8d 04 02 lea (%edx,%eax,1),%eax
f: d1 f8 sar %eax
11: c3 ret
int arithmetic1 ( int aValue )
{
return aValue >> 1;
}
00000020 <arithmetic1>:
20: 55 push %ebp
21: 89 e5 mov %esp,%ebp
23: 8b 45 08 mov 0x8(%ebp),%eax
26: 5d pop %ebp
27: d1 f8 sar %eax
29: c3 ret
int arithmetic2 ( int aValue )
{
return aValue * 2;
}
00000030 <arithmetic2>:
30: 55 push %ebp
31: 89 e5 mov %esp,%ebp
33: 8b 45 08 mov 0x8(%ebp),%eax
36: 5d pop %ebp
37: 01 c0 add %eax,%eax
39: c3 ret
int arithmetic3 ( int aValue )
{
return aValue << 1;
}
00000040 <arithmetic3>:
40: 55 push %ebp
41: 89 e5 mov %esp,%ebp
43: 8b 45 08 mov 0x8(%ebp),%eax
46: 5d pop %ebp
47: 01 c0 add %eax,%eax
49: c3 ret
int arithmetic4 ( int aValue )
{
return aValue / 4;
}
00000050 <arithmetic4>:
50: 55 push %ebp
51: 89 e5 mov %esp,%ebp
53: 8b 55 08 mov 0x8(%ebp),%edx
56: 5d pop %ebp
57: 89 d0 mov %edx,%eax
59: c1 f8 1f sar $0x1f,%eax
5c: c1 e8 1e shr $0x1e,%eax
5f: 01 d0 add %edx,%eax
61: c1 f8 02 sar $0x2,%eax
64: c3 ret
int arithmetic5 ( int aValue )
{
return aValue >> 2;
}
00000070 <arithmetic5>:
70: 55 push %ebp
71: 89 e5 mov %esp,%ebp
73: 8b 45 08 mov 0x8(%ebp),%eax
76: 5d pop %ebp
77: c1 f8 02 sar $0x2,%eax
7a: c3 ret
int arithmetic6 ( int aValue )
{
return aValue * 8;
}
00000080 <arithmetic6>:
80: 55 push %ebp
81: 89 e5 mov %esp,%ebp
83: 8b 45 08 mov 0x8(%ebp),%eax
86: 5d pop %ebp
87: c1 e0 03 shl $0x3,%eax
8a: c3 ret
int arithmetic7 ( int aValue )
{
return aValue << 4;
}
00000090 <arithmetic7>:
90: 55 push %ebp
91: 89 e5 mov %esp,%ebp
93: 8b 45 08 mov 0x8(%ebp),%eax
96: 5d pop %ebp
97: c1 e0 04 shl $0x4,%eax
9a: c3 ret
The divisions are different - with a two's complement representation, shifting a negative odd number right one results in a different value to dividing it by two. But the compiler still optimises the division to a sequence of shifts and additions.
The most obvious difference though is that this pair don't do the same thing - shifting by four is equivalent to multiplying by sixteen, not eight! You probably would not get a bug from this if you let the compiler sweat the small optimisations for you.
aNumber = aValue * 8;
aNumber = aValue << 4;
If you have big calculations in a tight loop kind of environment where calculation speed has an impact --- use bit operations. ( considered faster than arithmetic operations)
When its about power 2 numbers (2^x), its better to use shifts - it's just to 'push' the bits. (1 assembly operation instead of 2 in dividing).
Is there any language which its compiler does this optimization?
int i = -11;
std::cout << (i / 2) << '\n'; // prints -5 (well defined by the standard)
std::cout << (i >> 1) << '\n'; // prints -6 (may differ on other platform)
Depending on the desired rounding behavior, you may prefer one over the other.