What is the difference between bit shifting and arithmetical operations? - c++

int aNumber;
aNumber = aValue / 2;
aNumber = aValue >> 1;
aNumber = aValue * 2;
aNumber = aValue << 1;
aNumber = aValue / 4;
aNumber = aValue >> 2;
aNumber = aValue * 8;
aNumber = aValue << 3;
// etc.
Whats is the "best" way to do operations? When is better to use bit shifting?

The two are functionally equivalent in the examples you gave (except for the final one, which ought to read aValue * 8 == aValue << 3), if you are using positive integers. This is only the case when multiplying or dividing by powers of 2.
Bit shifting is never slower than arithmetic. Depending on your compiler, the arithmetic version may be compiled down to the bit-shifting version, in which case they both be as efficient. Otherwise, bit-shifting should be significantly faster than arithmetic.
The arithmetic version is often more readable, however. Consequently, I use the arithmetic version in almost all cases, and only use bit shifting if profiling reveals that the statement is in a bottleneck:
Programs should be written for people to read, and only incidentally for machines to execute.

The difference is that arithmetic operations have clearly defined results (unless they run into signed overflow that is). Shift operations don't have defined results in many cases. They are clearly defined for unsigned types in both C and C++, but with signed types things quickly get tricky.
In C++ language the arithmetical meaning of left-shift << for signed types is not defined. It just shifts bits, filling with zeros on the right. What it means in arithmetical sense depends on the signed representation used by the platform. Virtually the same is true for right-shift >> operator. Right-shifting negative values leads to implementation-defined results.
In C language things are defined slightly differently. Left-shifting negative values is impossible: it leads to undefined behavior. Right-shifting negative values leads to implementation-defined results.
On most practical implementations each single right-shift performs division by 2 with rounding towards negative infinity. This, BTW, is notably different from the arithmetic division / by 2, since typically (and always in C99) of the time it will round towards 0.
As for when you should use bit-shifting... Bit-shifting is for operations that work on bits. Bit-shifting operators are very rarely used as a replacement for arithmetic operators (for example, you should never use shifts to perform multiplication/division by constant).

Bit shifting is a 'close to the metal' operation that most of the time doesn't contain any information on what you really want to achieve.
If you want to divide a number by two, by all means, write x/2. It happens to be achieved by x >> 1, but the latter conceals the intent.
When that turns out to become a bottleneck, revise the code.

Whats is the "best" way to do operations?
Use arithmetic operations when dealing with numbers. Use bit operations when dealing with bits. Period. This is common sense. I doubt anyone would ever think using bit shift operations for ints or doubles as a regular day-to-day thing is a good idea.
When is better to use bit shifting?
When dealing with bits?
Additional question: do they behave the same in case of arithmetic overflow?
Yes. Appropriate arithmetic operations are (often, but not always) simplified to their bit shift counterparts by most modern compilers.
Edit: Answer was accepted, but I just want to add that there's a ton of bad advice in this question. You should never (read: almost never) use bit shift operations when dealing with ints. It's horrible practice.

When your goal is to multiply some numbers, using arithmetic operators makes sense.
When your goals is to actually logically shift the bits, then use the shift operators.
For instance, say you are splitting the RGB components from an RGB word, this code makes sense:
int r,g,b;
short rgb = 0x74f5;
b = rgb & 0x001f;
g = (rgb & 0x07e0) >> 5;
r = (rgb & 0xf800) >> 11;
on the other hand when you want to multiply some value with 4 you should really code your intent, and not do shifts.

As long as you are multiplying or dividing within the 2er powers it is faster to operate with a shift because it is a single operation (needs only one process cycle).
One gets used to reading << 1 as *2 and >>2 as /4 quite quickly so I do not agree with readability going away when using shifting but this is up to each person.
If you want to know more details about how and why, maybe wikipedia can help or if you want to go through the pain learn assembly ;-)

As an example of the differences, this is x86 assembly created using gcc 4.4 with -O3
int arithmetic0 ( int aValue )
{
return aValue / 2;
}
00000000 <arithmetic0>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 89 c2 mov %eax,%edx
9: c1 ea 1f shr $0x1f,%edx
c: 8d 04 02 lea (%edx,%eax,1),%eax
f: d1 f8 sar %eax
11: c3 ret
int arithmetic1 ( int aValue )
{
return aValue >> 1;
}
00000020 <arithmetic1>:
20: 55 push %ebp
21: 89 e5 mov %esp,%ebp
23: 8b 45 08 mov 0x8(%ebp),%eax
26: 5d pop %ebp
27: d1 f8 sar %eax
29: c3 ret
int arithmetic2 ( int aValue )
{
return aValue * 2;
}
00000030 <arithmetic2>:
30: 55 push %ebp
31: 89 e5 mov %esp,%ebp
33: 8b 45 08 mov 0x8(%ebp),%eax
36: 5d pop %ebp
37: 01 c0 add %eax,%eax
39: c3 ret
int arithmetic3 ( int aValue )
{
return aValue << 1;
}
00000040 <arithmetic3>:
40: 55 push %ebp
41: 89 e5 mov %esp,%ebp
43: 8b 45 08 mov 0x8(%ebp),%eax
46: 5d pop %ebp
47: 01 c0 add %eax,%eax
49: c3 ret
int arithmetic4 ( int aValue )
{
return aValue / 4;
}
00000050 <arithmetic4>:
50: 55 push %ebp
51: 89 e5 mov %esp,%ebp
53: 8b 55 08 mov 0x8(%ebp),%edx
56: 5d pop %ebp
57: 89 d0 mov %edx,%eax
59: c1 f8 1f sar $0x1f,%eax
5c: c1 e8 1e shr $0x1e,%eax
5f: 01 d0 add %edx,%eax
61: c1 f8 02 sar $0x2,%eax
64: c3 ret
int arithmetic5 ( int aValue )
{
return aValue >> 2;
}
00000070 <arithmetic5>:
70: 55 push %ebp
71: 89 e5 mov %esp,%ebp
73: 8b 45 08 mov 0x8(%ebp),%eax
76: 5d pop %ebp
77: c1 f8 02 sar $0x2,%eax
7a: c3 ret
int arithmetic6 ( int aValue )
{
return aValue * 8;
}
00000080 <arithmetic6>:
80: 55 push %ebp
81: 89 e5 mov %esp,%ebp
83: 8b 45 08 mov 0x8(%ebp),%eax
86: 5d pop %ebp
87: c1 e0 03 shl $0x3,%eax
8a: c3 ret
int arithmetic7 ( int aValue )
{
return aValue << 4;
}
00000090 <arithmetic7>:
90: 55 push %ebp
91: 89 e5 mov %esp,%ebp
93: 8b 45 08 mov 0x8(%ebp),%eax
96: 5d pop %ebp
97: c1 e0 04 shl $0x4,%eax
9a: c3 ret
The divisions are different - with a two's complement representation, shifting a negative odd number right one results in a different value to dividing it by two. But the compiler still optimises the division to a sequence of shifts and additions.
The most obvious difference though is that this pair don't do the same thing - shifting by four is equivalent to multiplying by sixteen, not eight! You probably would not get a bug from this if you let the compiler sweat the small optimisations for you.
aNumber = aValue * 8;
aNumber = aValue << 4;

If you have big calculations in a tight loop kind of environment where calculation speed has an impact --- use bit operations. ( considered faster than arithmetic operations)

When its about power 2 numbers (2^x), its better to use shifts - it's just to 'push' the bits. (1 assembly operation instead of 2 in dividing).
Is there any language which its compiler does this optimization?

int i = -11;
std::cout << (i / 2) << '\n'; // prints -5 (well defined by the standard)
std::cout << (i >> 1) << '\n'; // prints -6 (may differ on other platform)
Depending on the desired rounding behavior, you may prefer one over the other.

Related

Add+Mul become slower with Intrinsics - where am I wrong?

Having this array:
alignas(16) double c[voiceSize][blockSize];
This is the function I'm trying to optimize:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = value + deltaValue * sampleIndex;
}
}
And this is my intrinsics (SSE2) attempt:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
__m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
Which is slower than "scalar" (even if auto-optimized) original code, unfortunately :)
Where's the bottleneck in your opinion? Where am I wrong?
I'm using MSVC, Release/x86, /02 optimization flag (Favor fast code).
EDIT: doing this (suggested by #wim), it seems that performance become better than C version:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
__m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
__m128d sampleIndex_add = _mm_set1_pd(2.0);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
__m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
Why? Is _mm_setr_pd expensive?
Why? Is _mm_setr_pd expensive?
Somewhat; it takes at least a shuffle. More importantly in this case, computing each scalar operand is expensive, and as #spectras' answer shows, gcc at least fails to auto-vectorize that into paddd / cvtdq2pd. Instead it re-computes each operand from a scalar integer, doing the int->double conversion separately, then shuffles those together.
This is the function I'm trying to optimize:
You're simply filling an array with a linear function. You're re-multiplying every time inside the loop. That avoids a loop-carried dependency on anything except the integer loop counter, but you run into throughput bottlenecks from doing so much work inside the loop.
i.e. you're computing a[i] = c + i*scale separately for every step. But instead you can strength-reduce that to a[i+n] = a[i] + (n*scale). So you only have one addpd instruction per vector of results.
This will introduce some rounding error that accumulates vs. redoing the computation from scratch, but double is probably overkill for what you're doing anyway.
It also comes at the cost of introducing a serial dependency on an FP add instead of integer. But you already have a loop-carried FP add dependency chain in your "optimized" version that uses sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add); inside the loop, using FP += 2.0 instead of re-converting from integer.
So you'll want to unroll with multiple vectors to hide that FP latency, and keep at least 3 or 4 FP additions in flight at once. (Haswell: 3 cycle latency, one per clock throughput. Skylake: 4 cycle latency, 2 per clock throughput.) See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators for a similar problem with loop-carried dependencies (a dot product).
void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double val0 = start + step * delta;
double deltaValue = rate * delta;
__m128d vdelta2 = _mm_set1_pd(2 * deltaValue);
__m128d vdelta4 = _mm_add_pd(vdelta2, vdelta2);
__m128d v0 = _mm_setr_pd(val0, val0 + deltaValue);
__m128d v1 = _mm_add_pd(v0, vdelta2);
__m128d v2 = _mm_add_pd(v0, vdelta4);
__m128d v3 = _mm_add_pd(v1, vdelta4);
__m128d vdelta8 = _mm_mul_pd(vdelta2, _mm_set1_pd(4.0));
double *endp = pC + blocksize - 7; // stop if there's only room for 7 or fewer doubles
// or use -8 and have your cleanup handle lengths of 1..8
// since the inner loop always calculates results for next iteration
for (; pC < endp ; pC += 8) {
_mm_store_pd(pC, v0);
v0 = _mm_add_pd(v0, vdelta8);
_mm_store_pd(pC+2, v1);
v1 = _mm_add_pd(v1, vdelta8);
_mm_store_pd(pC+4, v2);
v2 = _mm_add_pd(v2, vdelta8);
_mm_store_pd(pC+6, v3);
v3 = _mm_add_pd(v3, vdelta8);
}
// if (blocksize % 8 != 0) ... store final vectors
}
The choice of whether to add or multiply when building up vdelta4 / vdelta8 is not very significant; I tried to avoid too long a dependency chain before the first stores can happen. Since v0 through v3 need to be calculated as well, it seemed to make sense to create a vdelta4 instead of just making a chain of v2 = v1+vdelta2. Maybe it would have been better to create vdelta4 with a multiply from 4.0*delta, and double it to get vdelta8. This could be relevant for very small block size, especially if you cache-block your code by only generating small chunks of this array as needed, right before it will be read.
Anyway, this compiles to a very efficient inner loop with gcc and MSVC (on the Godbolt compiler explorer).
;; MSVC -O2
$LL4#Process: ; do {
movups XMMWORD PTR [rax], xmm5
movups XMMWORD PTR [rax+16], xmm0
movups XMMWORD PTR [rax+32], xmm1
movups XMMWORD PTR [rax+48], xmm2
add rax, 64 ; 00000040H
addpd xmm5, xmm3 ; v0 += vdelta8
addpd xmm0, xmm3 ; v1 += vdelta8
addpd xmm1, xmm3 ; v2 += vdelta8
addpd xmm2, xmm3 ; v3 += vdelta8
cmp rax, rcx
jb SHORT $LL4#Process ; }while(pC < endp)
This has 4 separate dependency chains, through xmm0, 1, 2, and 5. So there's enough instruction-level parallelism to keep 4 addpd instructions in flight. This is more than enough for Haswell, but half of what Skylake can sustain.
Still, with a store throughput of 1 vector per clock, more than 1 addpd per clock isn't useful. In theory this can run at about 16 bytes per clock cycle, and saturate store throughput. i.e. 1 vector / 2 doubles per clock.
AVX with wider vectors (4 doubles) could still go at 1 vector per clock on Haswell and later, i.e. 32 bytes per clock. (Assuming the output array is hot in L1d cache or possibly even L2.)
Even better: don't store this data in memory at all; re-generate on the fly.
Generate it on the fly when it's needed, if the code consuming it only reads it a few times, and is also manually vectorized.
On my system, g++ test.cpp -march=native -O2 -c -o test
This will output for the normal version (loop body extract):
30: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
34: c5 fb 2a c0 vcvtsi2sd %eax,%xmm0,%xmm0
38: c4 e2 f1 99 c2 vfmadd132sd %xmm2,%xmm1,%xmm0
3d: c5 fb 11 04 c2 vmovsd %xmm0,(%rdx,%rax,8)
42: 48 83 c0 01 add $0x1,%rax
46: 48 39 c8 cmp %rcx,%rax
49: 75 e5 jne 30 <_Z11ProcessAutoii+0x30>
And for the intrinsics version:
88: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
8c: 8d 50 01 lea 0x1(%rax),%edx
8f: c5 f1 57 c9 vxorpd %xmm1,%xmm1,%xmm1
93: c5 fb 2a c0 vcvtsi2sd %eax,%xmm0,%xmm0
97: c5 f3 2a ca vcvtsi2sd %edx,%xmm1,%xmm1
9b: c5 f9 14 c1 vunpcklpd %xmm1,%xmm0,%xmm0
9f: c4 e2 e9 98 c3 vfmadd132pd %xmm3,%xmm2,%xmm0
a4: c5 f8 29 04 c1 vmovaps %xmm0,(%rcx,%rax,8)
a9: 48 83 c0 02 add $0x2,%rax
ad: 48 39 f0 cmp %rsi,%rax
b0: 75 d6 jne 88 <_Z11ProcessSSE2ii+0x38>
So in short: the compiler automatically generates AVX code from the C version.
Edit after playing a bit more with flags to have SSE2 only in both cases:
g++ test.cpp -msse2 -O2 -c -o test
The compiler still does something different from what you generate with intrinsics. Compiler version:
30: 66 0f ef c0 pxor %xmm0,%xmm0
34: f2 0f 2a c0 cvtsi2sd %eax,%xmm0
38: f2 0f 59 c2 mulsd %xmm2,%xmm0
3c: f2 0f 58 c1 addsd %xmm1,%xmm0
40: f2 0f 11 04 c2 movsd %xmm0,(%rdx,%rax,8)
45: 48 83 c0 01 add $0x1,%rax
49: 48 39 c8 cmp %rcx,%rax
4c: 75 e2 jne 30 <_Z11ProcessAutoii+0x30>
Intrinsics version:
88: 66 0f ef c0 pxor %xmm0,%xmm0
8c: 8d 50 01 lea 0x1(%rax),%edx
8f: 66 0f ef c9 pxor %xmm1,%xmm1
93: f2 0f 2a c0 cvtsi2sd %eax,%xmm0
97: f2 0f 2a ca cvtsi2sd %edx,%xmm1
9b: 66 0f 14 c1 unpcklpd %xmm1,%xmm0
9f: 66 0f 59 c3 mulpd %xmm3,%xmm0
a3: 66 0f 58 c2 addpd %xmm2,%xmm0
a7: 0f 29 04 c1 movaps %xmm0,(%rcx,%rax,8)
ab: 48 83 c0 02 add $0x2,%rax
af: 48 39 f0 cmp %rsi,%rax
b2: 75 d4 jne 88 <_Z11ProcessSSE2ii+0x38>
Compiler does not unroll the loop here. It might be better or worse depending on many things. You might want to bench both versions.

Assuming a is double, is 2.0*a faster than 2*a?

Long time ago in some book about the ancient FORTRAN I have seen the claim that using the integer constant with floating point variable is slower, as the constant needs to be converted to the floating point form first:
double a = ..;
double b = a*2; // 2 -> 2.0 first
double c = a*2.0;
Is it still beneficial to write 2.0 rather than 2 in the modern C++? If not, probably the "integer version" should be preferred as 2.0 is longer and does not make any difference for a human reader.
I work with complex, long expressions where these ".0"s would make a difference in either performance or readability, if any applies.
First to cover other answers, no 2 vs 2.0 will not cause a performance difference, this will be checked at compile time to create the correct value. However to answer the question:
Is it still beneficial to write 2.0 rather than 2 in the modern C++?
Absolutely.
But it's not because of performance, but readability and bugs. Imagine the following operation:
double a = (2 / someOtherNumber) * someFloat;
What is the type of someOtherNumber? Because if it is an integer type then you are in trouble because of integer division. 2.0 or 2.0f has the distinct advantages:
Tells the reader of the code exactly what you intended.
Avoids mistakes from integer division where you didn't intend it.
Original question:
Let's compare the assembly output.
double foo(double a)
{
return a * 2;
}
double bar(double a)
{
return a * 2.0f;
}
double baz(double a)
{
return a * 2.0;
}
results in
0000000000000000 <foo>: //double x int
0: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
4: c3 retq // return (quad)
5: 90 nop // padding
6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) // padding
d: 00 00 00
0000000000000010 <bar>: //double x float
10: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
14: c3 retq // return (quad)
15: 90 nop // padding
16: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) // padding
1d: 00 00 00
0000000000000020 <baz>: //double x double
20: f2 0f 58 c0 addsd %xmm0,%xmm0 // add with itself
24: c3 retq // return (quad)
As you can see, they are all equal and do not perform a multiplication at all.
Even when doing real multiplication (a*5), they are all equal and perform down to
0: f2 0f 59 05 00 00 00 mulsd 0x0(%rip),%xmm0 # 8 <foo+0x8>
7: 00
8: c3 retq
Addition:
#Goswin-Von-Brederlow remarks, that using a non constant expression will lead to different assembly. Let's test this like the one above, but with the following signature.
double foo(double a, int b); //int, float, double for foo/bar/baz
which leads to the output:
0000000000000000 <foo>: //double x int
0: 66 0f ef c9 pxor %xmm1,%xmm1 // clear xmm1
4: f2 0f 2a cf cvtsi2sd %edi,%xmm1 // convert edi (second argument) to double
8: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul xmm1 with xmm0
c: c3 retq // return
d: 0f 1f 00 nopl (%rax) // padding
0000000000000010 <bar>: //double x float
10: f3 0f 5a c9 cvtss2sd %xmm1,%xmm1 // convert float to double
14: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul
18: c3 retq // return
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) // padding
0000000000000020 <baz>: //double x double
20: f2 0f 59 c1 mulsd %xmm1,%xmm0 // mul directly
24: c3 retq // return
Here you can see the (runtime) conversion from the types to a double, which leads of course to (runtime) overhead.
No.
The following code:
double f1(double a) {
double b = a*2;
return b;
}
double f2(double a) {
double c = a*2.0;
return c;
}
... when compiled on gcc.godbolt.org with Clang, produces the following assembly:
f1(double): # #f1(double)
addsd xmm0, xmm0
ret
f2(double): # #f2(double)
addsd xmm0, xmm0
ret
You can see that both functions are perfectly identical, and the compiler even replaced the multiplication by an addition. I'd expect the same for any C++ compiler from this millenium -- trust them, they're pretty smart.
No, it's not faster. Why would a compiler wait until runtime to convert an integer to a floating point number, if it knew what that number was going to be? I suppose it's possible that you might convince some exceedingly pedantic compiler to do that, if you disabled optimization completely, but all the compilers I'm aware of would do that optimization as a matter of course.
Now, if you were doing a*b with a a floating point type and b an integer type, and neither one was a compile-time literal, on some architectures that could cause a significant performance hit (particularly if you'd calculated b very recently). But in the case of literals, the compiler already has your back.

Vectorizing indirect access through avx instructions

I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop
Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.

gcc '-m32' option changes floating-point rounding when not running valgrind

I am getting different floating-point rounding under different build/execute scenarios. Notice the 2498 in the second run below...
#include <iostream>
#include <cassert>
#include <typeinfo>
using std::cerr;
void domath( int n, double c, double & q1, double & q2 )
{
q1=n*c;
q2=int(n*c);
}
int main()
{
int n=2550;
double c=0.98, q1, q2;
domath( n, c, q1, q2 );
cerr<<"sizeof(int)="<<sizeof(int)<<", sizeof(double)="<<sizeof(double)<<", sizeof(n*c)="<<sizeof(n*c)<<"\n";
cerr<<"n="<<n<<", int(q1)="<<int(q1)<<", int(q2)="<<int(q2)<<"\n";
assert( typeid(q1) == typeid(n*c) );
}
Running as a 64-bit executable...
$ g++ -m64 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Running as a 32-bit executable...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2498
Running as a 32-bit executable under valgrind...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && valgrind --quiet ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Why am I seeing different results when compiling with -m32, and why are the results different again when running valgrind?
My system is Ubuntu 14.04.1 LTS x86_64, and my gcc is version 4.8.2.
EDIT:
In response to the request for disassembly, I have refactored the code a bit so that I could isolate the relevant portion. The approach taken between -m64 and -m32 is clearly much different. I'm not too concerned about why these give a different rounding result since I can fix that by applying the round() function. The most interesting question is: why does valgrind change the result?
rounding_test: file format elf64-x86-64
<
000000000040090d <_Z6domathidRdS_>: <
40090d: 55 push %rbp <
40090e: 48 89 e5 mov %rsp,%rbp <
400911: 89 7d fc mov %edi,-0x4(%rbp <
400914: f2 0f 11 45 f0 movsd %xmm0,-0x10(%r <
400919: 48 89 75 e8 mov %rsi,-0x18(%rb <
40091d: 48 89 55 e0 mov %rdx,-0x20(%rb <
400921: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400926: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40092b: 48 8b 45 e8 mov -0x18(%rbp),%r <
40092f: f2 0f 11 00 movsd %xmm0,(%rax) <
400933: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400938: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40093d: f2 0f 2c c0 cvttsd2si %xmm0,%eax <
400941: f2 0f 2a c0 cvtsi2sd %eax,%xmm0 <
400945: 48 8b 45 e0 mov -0x20(%rbp),%r <
400949: f2 0f 11 00 movsd %xmm0,(%rax) <
40094d: 5d pop %rbp <
40094e: c3 retq <
| rounding_test: file format elf32-i386
> 0804871d <_Z6domathidRdS_>:
> 804871d: 55 push %ebp
> 804871e: 89 e5 mov %esp,%ebp
> 8048720: 83 ec 10 sub $0x10,%esp
> 8048723: 8b 45 0c mov 0xc(%ebp),%eax
> 8048726: 89 45 f8 mov %eax,-0x8(%ebp
> 8048729: 8b 45 10 mov 0x10(%ebp),%ea
> 804872c: 89 45 fc mov %eax,-0x4(%ebp
> 804872f: db 45 08 fildl 0x8(%ebp)
> 8048732: dc 4d f8 fmull -0x8(%ebp)
> 8048735: 8b 45 14 mov 0x14(%ebp),%ea
> 8048738: dd 18 fstpl (%eax)
> 804873a: db 45 08 fildl 0x8(%ebp)
> 804873d: dc 4d f8 fmull -0x8(%ebp)
> 8048740: d9 7d f6 fnstcw -0xa(%ebp)
> 8048743: 0f b7 45 f6 movzwl -0xa(%ebp),%ea
> 8048747: b4 0c mov $0xc,%ah
> 8048749: 66 89 45 f4 mov %ax,-0xc(%ebp)
> 804874d: d9 6d f4 fldcw -0xc(%ebp)
> 8048750: db 5d f0 fistpl -0x10(%ebp)
> 8048753: d9 6d f6 fldcw -0xa(%ebp)
> 8048756: 8b 45 f0 mov -0x10(%ebp),%e
> 8048759: 89 45 f0 mov %eax,-0x10(%eb
> 804875c: db 45 f0 fildl -0x10(%ebp)
> 804875f: 8b 45 18 mov 0x18(%ebp),%ea
> 8048762: dd 18 fstpl (%eax)
> 8048764: c9 leave
> 8048765: c3 ret
Edit: It would seem that, at least a long time back, valgrind's floating point calculations wheren't quite as accurate as the "real" calculations. In other words, this MAY explain why you get different results. See this question and answer on the valgrind mailing list.
Edit2: And the current valgrind.org documentation has it in it's "core limitations" section here - so I would expect that it is indeed "still valid". In other words the documentation for valgrind says to expect differences between valgrind and x87 FPU calculations. "You have been warned!" (And as we can see, using sse instructions to do the same math produces the same result as valgrind, confirming that it's a "rounding from 80 bits to 64 bits" difference)
Floating point calculations WILL differ slightly depending on exactly how the calculation is performed. I'm not sure exactly what you want to have an answer to, so here's a long rambling "answer of a sort".
Valgrind DOES indeed change the exact behaviour of your program in various ways (it emulates certain instructions, rather than actually executing the real instructions - which may include saving the intermediate results of calculations). Also, floating point calculations are well known to "not be precise" - it's just a matter of luck/bad luck if the calculation comes out precise or not. 0.98 is one of many, many numbers that can't be described precisely in floating point format [at least not the common IEEE formats].
By adding:
cerr<<"c="<<std::setprecision(30)<<c <<"\n";
we see that the output is c=0.979999999999999982236431605997 (yes, the actual value is 0.979999...99982 or some such, remaining digits is just the residual value, since it's not an "even" binary number, there's always going to be something left.
This is the n = 2550;, c = 0.98 and q = n * c part of the code as generated by gcc:
movl $2550, -28(%ebp) ; n
fldl .LC0
fstpl -40(%ebp) ; c
fildl -28(%ebp)
fmull -40(%ebp)
fstpl -48(%ebp) ; q - note that this is stored as a rouned 64-bit value.
This is the int(q) and int(n*c) part of the code:
fildl -28(%ebp) ; n
fmull -40(%ebp) ; c
fnstcw -58(%ebp) ; Save control word
movzwl -58(%ebp), %eax
movb $12, %ah
movw %ax, -60(%ebp) ; Save float control word.
fldcw -60(%ebp)
fistpl -64(%ebp) ; Store as integer (directly from 80-bit result)
fldcw -58(%ebp) ; restore float control word.
movl -64(%ebp), %ebx ; result of int(n * c)
fldl -48(%ebp) ; q
fldcw -60(%ebp) ; Load float control word as saved above.
fistpl -64(%ebp) ; Store as integer.
fldcw -58(%ebp) ; Restore control word.
movl -64(%ebp), %esi ; result of int(q)
Now, if the intermediate result is stored (and thus rounded) from the internal 80-bit precision in the middle of one of those calculations, the result may be subtly different from the result if the calculation happens without saving intermediate values.
I get identical results from both g++ 4.9.2 and clang++ -mno-sse - but if I enable sse in the clang case, it gives the same result as 64-bit build. Using gcc -msse2 -m32 gives the 2499 answer everywhere. This indicates that the error is caused by "storing intermediate results" in some way or another.
Likewise, optimising in gcc to -O1 will give the 2499 in all places - but this is a coincidence, not a result of some "clever thinking". If you want correctly rounded integer values of your calculations, you're much better off rounding yourself, because sooner or later int(someDoubleValue) will come up "one short".
Edit3: And finally, using g++ -mno-sse -m64 will also produce the same 2498 answer, where using valgrind on the same binary produces the 2499 answer.
The 32-bit version uses X87 floating point instructions. X87 internally uses 80-bit floating point numbers, which will cause trouble when numbers are converted to and from other precisions. In your case the 64-bit precision approximation for 0.98 is slightly less than the true value. When the CPU converts it to an 80-bit value you get the exact same numerical value, which is an equally bad approximation - having more bits doesn't get you a better approximation. The FPU then multiplies that number by 2550, and gets a figure that's slightly less than 2499. If the CPU used 64-bit numbers all the way it should compute exactly 2499, like it does in the 64-bit version.

How could I implement logical implication with bitwise or other efficient code in C?

I want to implement a logical operation that works as efficient as possible. I need this truth table:
p q p → q
T T T
T F F
F T T
F F T
This, according to wikipedia is called "logical implication"
I've been long trying to figure out how to make this with bitwise operations in C without using conditionals. Maybe someone has got some thoughts about it.
Thanks
!p || q
is plenty fast. seriously, don't worry about it.
~p | q
For visualization:
perl -e'printf "%x\n", (~0x1100 | 0x1010) & 0x1111'
1011
In tight code, this should be faster than "!p || q" because the latter has a branch, which might cause a stall in the CPU due to a branch prediction error. The bitwise version is deterministic and, as a bonus, can do 32 times as much work in a 32-bit integer than the boolean version!
FYI, with gcc-4.3.3:
int foo(int a, int b) { return !a || b; }
int bar(int a, int b) { return ~a | b; }
Gives (from objdump -d):
0000000000000000 <foo>:
0: 85 ff test %edi,%edi
2: 0f 94 c2 sete %dl
5: 85 f6 test %esi,%esi
7: 0f 95 c0 setne %al
a: 09 d0 or %edx,%eax
c: 83 e0 01 and $0x1,%eax
f: c3 retq
0000000000000010 <bar>:
10: f7 d7 not %edi
12: 09 fe or %edi,%esi
14: 89 f0 mov %esi,%eax
16: c3 retq
So, no branches, but twice as many instructions.
And even better, with _Bool (thanks #litb):
_Bool baz(_Bool a, _Bool b) { return !a || b; }
0000000000000020 <baz>:
20: 40 84 ff test %dil,%dil
23: b8 01 00 00 00 mov $0x1,%eax
28: 0f 45 c6 cmovne %esi,%eax
2b: c3 retq
So, using _Bool instead of int is a good idea.
Since I'm updating today, I've confirmed gcc 8.2.0 produces similar, though not identical, results for _Bool:
0000000000000020 <baz>:
20: 83 f7 01 xor $0x1,%edi
23: 89 f8 mov %edi,%eax
25: 09 f0 or %esi,%eax
27: c3 retq
You can read up on deriving boolean expressions from truth Tables (also see canonical form), on how you can express any truth table as a combination of boolean primitives or functions.
Another solution for C booleans (a bit dirty, but works):
((unsigned int)(p) <= (unsigned int)(q))
It works since by the C standard, 0 represents false, and any other value true (1 is returned for true by boolean operators, int type).
The "dirtiness" is that I use booleans (p and q) as integers, which contradicts some strong typing policies (such as MISRA), well, this is an optimization question. You may always #define it as a macro to hide the dirty stuff.
For proper boolean p and q (having either 0 or 1 binary representations) it works. Otherwise T->T might fail to produce T if p and q have arbitrary nonzero values for representing true.
If you need to store the result only, since the Pentium II, there is the cmovcc (Conditional Move) instruction (as shown in Derobert's answer). For booleans, however even the 386 had a branchless option, the setcc instruction, which produces 0 or 1 in a result byte location (byte register or memory). You can also see that in Derobert's answer, and this solution also compiles to a result involving a setcc (setbe: Set if below or equal).
Derobert and Chris Dolan's ~p | q variant should be the fastest for processing large quantities of data since it can process the implication on all bits of p and q individually.
Notice that not even the !p || q solution compiles to branching code on the x86: it uses setcc instructions. That's the best solution if p or q may contain arbitrary nonzero values representing true. If you use the _Bool type, it will generate very few instructions.
I got the following figures when compiling for the x86:
__attribute__((fastcall)) int imp1(int a, int b)
{
return ((unsigned int)(a) <= (unsigned int)(b));
}
__attribute__((fastcall)) int imp2(int a, int b)
{
return (!a || b);
}
__attribute__((fastcall)) _Bool imp3(_Bool a, _Bool b)
{
return (!a || b);
}
__attribute__((fastcall)) int imp4(int a, int b)
{
return (~a | b);
}
Assembly result:
00000000 <imp1>:
0: 31 c0 xor %eax,%eax
2: 39 d1 cmp %edx,%ecx
4: 0f 96 c0 setbe %al
7: c3 ret
00000010 <imp2>:
10: 85 d2 test %edx,%edx
12: 0f 95 c0 setne %al
15: 85 c9 test %ecx,%ecx
17: 0f 94 c2 sete %dl
1a: 09 d0 or %edx,%eax
1c: 0f b6 c0 movzbl %al,%eax
1f: c3 ret
00000020 <imp3>:
20: 89 c8 mov %ecx,%eax
22: 83 f0 01 xor $0x1,%eax
25: 09 d0 or %edx,%eax
27: c3 ret
00000030 <imp4>:
30: 89 d0 mov %edx,%eax
32: f7 d1 not %ecx
34: 09 c8 or %ecx,%eax
36: c3 ret
When using the _Bool type, the compiler clearly exploits that it only has two possible values (0 for false and 1 for true), producing a very similar result to the ~a | b solution (the only difference being that the latter performs a complement on all bits instead of just the lowest bit).
Compiling for 64 bits gives just about the same results.
Anyway, it is clear, the method doesn't really matter from the point of avoiding producing conditionals.
you can exchange implication to equel or less. it works
p <= q