AVX2 - method is 14x slower over a classic version [duplicate]

AVX2 - method is 14x slower over a classic version [duplicate] - c++

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER.
Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual
So what is going on?
The only theory I have left is that there's a bug in the CPU and it's incorrectly triggering the "save the upper half of the AVX registers" procedure where it shouldn't. Or something else just as strange.
This is main.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c );
int main()
{
/* DAZ and FTZ, does not change anything here. */
_mm_setcsr( _mm_getcsr() | 0x8040 );
/* This instruction fixes performance. */
__asm__ __volatile__ ( "vzeroupper" : : : );
int r = 0;
for( unsigned j = 0; j < 100000000; ++j )
{
r |= slow_function(
0.84445079384884236262,
-6.1000481519580951328,
5.0302160279288017364 );
}
return r;
}
and this is slow_function.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c )
{
__m128d sign_bit = _mm_set_sd( -0.0 );
__m128d q_a = _mm_set_sd( i_a );
__m128d q_b = _mm_set_sd( i_b );
__m128d q_c = _mm_set_sd( i_c );
int vmask;
const __m128d zero = _mm_setzero_pd();
__m128d q_abc = _mm_add_sd( _mm_add_sd( q_a, q_b ), q_c );
if( _mm_comigt_sd( q_c, zero ) && _mm_comigt_sd( q_abc, zero ) )
{
return 7;
}
__m128d discr = _mm_sub_sd(
_mm_mul_sd( q_b, q_b ),
_mm_mul_sd( _mm_mul_sd( q_a, q_c ), _mm_set_sd( 4.0 ) ) );
__m128d sqrt_discr = _mm_sqrt_sd( discr, discr );
__m128d q = sqrt_discr;
__m128d v = _mm_div_pd(
_mm_shuffle_pd( q, q_c, _MM_SHUFFLE2( 0, 0 ) ),
_mm_shuffle_pd( q_a, q, _MM_SHUFFLE2( 0, 0 ) ) );
vmask = _mm_movemask_pd(
_mm_and_pd(
_mm_cmplt_pd( zero, v ),
_mm_cmple_pd( v, _mm_set1_pd( 1.0 ) ) ) );
return vmask + 1;
}
The function compiles down to this with clang:
0: f3 0f 7e e2 movq %xmm2,%xmm4
4: 66 0f 57 db xorpd %xmm3,%xmm3
8: 66 0f 2f e3 comisd %xmm3,%xmm4
c: 76 17 jbe 25 <_Z13slow_functionddd+0x25>
e: 66 0f 28 e9 movapd %xmm1,%xmm5
12: f2 0f 58 e8 addsd %xmm0,%xmm5
16: f2 0f 58 ea addsd %xmm2,%xmm5
1a: 66 0f 2f eb comisd %xmm3,%xmm5
1e: b8 07 00 00 00 mov $0x7,%eax
23: 77 48 ja 6d <_Z13slow_functionddd+0x6d>
25: f2 0f 59 c9 mulsd %xmm1,%xmm1
29: 66 0f 28 e8 movapd %xmm0,%xmm5
2d: f2 0f 59 2d 00 00 00 mulsd 0x0(%rip),%xmm5 # 35 <_Z13slow_functionddd+0x35>
34: 00
35: f2 0f 59 ea mulsd %xmm2,%xmm5
39: f2 0f 58 e9 addsd %xmm1,%xmm5
3d: f3 0f 7e cd movq %xmm5,%xmm1
41: f2 0f 51 c9 sqrtsd %xmm1,%xmm1
45: f3 0f 7e c9 movq %xmm1,%xmm1
49: 66 0f 14 c1 unpcklpd %xmm1,%xmm0
4d: 66 0f 14 cc unpcklpd %xmm4,%xmm1
51: 66 0f 5e c8 divpd %xmm0,%xmm1
55: 66 0f c2 d9 01 cmpltpd %xmm1,%xmm3
5a: 66 0f c2 0d 00 00 00 cmplepd 0x0(%rip),%xmm1 # 63 <_Z13slow_functionddd+0x63>
61: 00 02
63: 66 0f 54 cb andpd %xmm3,%xmm1
67: 66 0f 50 c1 movmskpd %xmm1,%eax
6b: ff c0 inc %eax
6d: c3 retq
The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.
And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.
I've tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.
EDIT: Here's a pure assembly version, for linux. Comments below.
.text
.p2align 4, 0x90
.globl _start
_start:
#vmovaps %ymm0, %ymm1 # This makes SSE code crawl.
#vzeroupper # This makes it fast again.
movl $100000000, %ebp
.p2align 4, 0x90
.LBB0_1:
xorpd %xmm0, %xmm0
xorpd %xmm1, %xmm1
xorpd %xmm2, %xmm2
movq %xmm2, %xmm4
xorpd %xmm3, %xmm3
movapd %xmm1, %xmm5
addsd %xmm0, %xmm5
addsd %xmm2, %xmm5
mulsd %xmm1, %xmm1
movapd %xmm0, %xmm5
mulsd %xmm2, %xmm5
addsd %xmm1, %xmm5
movq %xmm5, %xmm1
sqrtsd %xmm1, %xmm1
movq %xmm1, %xmm1
unpcklpd %xmm1, %xmm0
unpcklpd %xmm4, %xmm1
decl %ebp
jne .LBB0_1
mov $0x1, %eax
int $0x80
Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.
As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.
I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.

You are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions - even though your entire visible application doesn't obviously use any AVX instructions!
Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn't, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.
Straight from the horse's mouth, here's Figure 11-1 1 - the old (pre-Skylake) transition diagram:
As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don't pay any penalties after that.
In Skylake, everything is different per Figure 11-2:
There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That's what happens to you - any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.
Here's what Intel says (section 11.3) about the new penalty:
The Skylake microarchitecture implements a different state machine
than prior generations to manage the YMM state transition associated
with mixing SSE and AVX instructions. It no longer saves the entire
upper YMM state when executing an SSE instruction when in “Modified
and Unsaved” state, but saves the upper bits of individual register.
As a result, mixing SSE and AVX instructions will experience a penalty
associated with partial register dependency of the destination
registers being used and additional blend operation on the upper bits
of the destination registers.
So the penalty is apparently quite large - it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0, since the result is actually dependent on the hidden upper bits from ymm0 which aren't cleared by the xorpd. That latter effect is probably what kills your performance since you'll now have very long dependency chains that wouldn't expect from the usual analysis.
This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another "gotcha" to the list of subtle performance issues.
I would file a bug against the compiler or runtime that inserted that AVX instruction and didn't follow up with a VZEROUPPER.
Update: Per the OP's comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.
1 From Intel's optimization manual.

I just made some experiments (on a Haswell). The transition between clean and dirty states is not expensive, but the dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. In your case, for example movapd %xmm1, %xmm5 will have a false dependency on ymm5 which prevents out-of-order execution. This explains why vzeroupper is needed after AVX code.

Related

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with (long)float type punning)

I'm trying to benchmark the fast inverse square root. The full code is here:
#include <benchmark/benchmark.h>
#include <math.h>
float number = 30942;
static void BM_FastInverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
// from wikipedia:
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
// y = y * ( threehalfs - ( x2 * y * y ) );
float result = y;
benchmark::DoNotOptimize(result);
}
}
static void BM_InverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
float result = 1 / sqrt(number);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);
and here is the code in quick-bench if you want to run it yourself.
Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.
Here is the relevant Assembly (taken from quick-bench).
With GCC:
push %rbp
mov %rdi,%rbp
push %rbx
sub $0x18,%rsp
cmpb $0x0,0x1a(%rdi)
je 408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
callq 40a770 <benchmark::State::StartKeepRunning()>
408c84 add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
nopw 0x0(%rax,%rax,1)
408c98 mov 0x10(%rdi),%rbx
callq 40a770 <benchmark::State::StartKeepRunning()>
test %rbx,%rbx
je 408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
movss 0x1b386(%rip),%xmm4 # 424034 <_IO_stdin_used+0x34>
movss 0x1b382(%rip),%xmm3 # 424038 <_IO_stdin_used+0x38>
mov $0x5f3759df,%edx
nopl 0x0(%rax,%rax,1)
408cc0 movss 0x237a8(%rip),%xmm0 # 42c470 <number>
mov %edx,%ecx
movaps %xmm3,%xmm1
2.91% movss %xmm0,0xc(%rsp)
mulss %xmm4,%xmm0
mov 0xc(%rsp),%rax
44.70% sar %rax
3.27% sub %eax,%ecx
3.24% movd %ecx,%xmm2
3.27% mulss %xmm2,%xmm0
9.58% mulss %xmm2,%xmm0
10.00% subss %xmm0,%xmm1
10.03% mulss %xmm2,%xmm1
9.64% movss %xmm1,0x8(%rsp)
3.33% sub $0x1,%rbx
jne 408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
408d0a jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
With Clang:
push %rbp
push %r14
push %rbx
sub $0x10,%rsp
mov %rdi,%r14
mov 0x1a(%rdi),%bpl
mov 0x10(%rdi),%rbx
call 213a80 <benchmark::State::StartKeepRunning()>
test %bpl,%bpl
jne 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
test %rbx,%rbx
je 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
movss -0xf12e(%rip),%xmm0 # 203cec <_IO_stdin_used+0x8>
movss -0xf13a(%rip),%xmm1 # 203ce8 <_IO_stdin_used+0x4>
cs nopw 0x0(%rax,%rax,1)
nopl 0x0(%rax)
212e30 2.46% movd 0x3c308(%rip),%xmm2 # 24f140 <number>
4.83% movd %xmm2,%eax
8.07% mulss %xmm0,%xmm2
12.35% shr %eax
2.60% mov $0x5f3759df,%ecx
5.15% sub %eax,%ecx
8.02% movd %ecx,%xmm3
11.53% mulss %xmm3,%xmm2
3.16% mulss %xmm3,%xmm2
5.71% addss %xmm1,%xmm2
8.19% mulss %xmm3,%xmm2
16.44% movss %xmm2,0xc(%rsp)
11.50% add $0xffffffffffffffff,%rbx
jne 212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
212e69 mov %r14,%rdi
call 213af0 <benchmark::State::FinishKeepRunning()>
add $0x10,%rsp
pop %rbx
pop %r14
pop %rbp
212e79 ret
They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss. The GCC version has a sar that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.
Anyone know?

Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration.
As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float. So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.
Your perf report output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const.)
The instruction waiting for the load result (the sar) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)
Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. (A function call still used %rdi so that isn't Windows clang.)
Bugfixed version: GCC and clang make similar asm
Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast, although not identical. e.g. GCC loads number twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2 to get it.
On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math
And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math, thanks to C++ inferring sqrtf from sqrt(float). It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).
Quick-bench does allow -Ofast, which is equivalent to -O3 -ffast-math, which uses rsqrtss and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))).
Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages so I can't test on that system.
rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.

Vectorizing indirect access through avx instructions

I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop

Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.

gcc '-m32' option changes floating-point rounding when not running valgrind

I am getting different floating-point rounding under different build/execute scenarios. Notice the 2498 in the second run below...
#include <iostream>
#include <cassert>
#include <typeinfo>
using std::cerr;
void domath( int n, double c, double & q1, double & q2 )
{
q1=n*c;
q2=int(n*c);
}
int main()
{
int n=2550;
double c=0.98, q1, q2;
domath( n, c, q1, q2 );
cerr<<"sizeof(int)="<<sizeof(int)<<", sizeof(double)="<<sizeof(double)<<", sizeof(n*c)="<<sizeof(n*c)<<"\n";
cerr<<"n="<<n<<", int(q1)="<<int(q1)<<", int(q2)="<<int(q2)<<"\n";
assert( typeid(q1) == typeid(n*c) );
}
Running as a 64-bit executable...
$ g++ -m64 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Running as a 32-bit executable...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2498
Running as a 32-bit executable under valgrind...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && valgrind --quiet ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Why am I seeing different results when compiling with -m32, and why are the results different again when running valgrind?
My system is Ubuntu 14.04.1 LTS x86_64, and my gcc is version 4.8.2.
EDIT:
In response to the request for disassembly, I have refactored the code a bit so that I could isolate the relevant portion. The approach taken between -m64 and -m32 is clearly much different. I'm not too concerned about why these give a different rounding result since I can fix that by applying the round() function. The most interesting question is: why does valgrind change the result?
rounding_test: file format elf64-x86-64
<
000000000040090d <_Z6domathidRdS_>: <
40090d: 55 push %rbp <
40090e: 48 89 e5 mov %rsp,%rbp <
400911: 89 7d fc mov %edi,-0x4(%rbp <
400914: f2 0f 11 45 f0 movsd %xmm0,-0x10(%r <
400919: 48 89 75 e8 mov %rsi,-0x18(%rb <
40091d: 48 89 55 e0 mov %rdx,-0x20(%rb <
400921: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400926: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40092b: 48 8b 45 e8 mov -0x18(%rbp),%r <
40092f: f2 0f 11 00 movsd %xmm0,(%rax) <
400933: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400938: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40093d: f2 0f 2c c0 cvttsd2si %xmm0,%eax <
400941: f2 0f 2a c0 cvtsi2sd %eax,%xmm0 <
400945: 48 8b 45 e0 mov -0x20(%rbp),%r <
400949: f2 0f 11 00 movsd %xmm0,(%rax) <
40094d: 5d pop %rbp <
40094e: c3 retq <
| rounding_test: file format elf32-i386
> 0804871d <_Z6domathidRdS_>:
> 804871d: 55 push %ebp
> 804871e: 89 e5 mov %esp,%ebp
> 8048720: 83 ec 10 sub $0x10,%esp
> 8048723: 8b 45 0c mov 0xc(%ebp),%eax
> 8048726: 89 45 f8 mov %eax,-0x8(%ebp
> 8048729: 8b 45 10 mov 0x10(%ebp),%ea
> 804872c: 89 45 fc mov %eax,-0x4(%ebp
> 804872f: db 45 08 fildl 0x8(%ebp)
> 8048732: dc 4d f8 fmull -0x8(%ebp)
> 8048735: 8b 45 14 mov 0x14(%ebp),%ea
> 8048738: dd 18 fstpl (%eax)
> 804873a: db 45 08 fildl 0x8(%ebp)
> 804873d: dc 4d f8 fmull -0x8(%ebp)
> 8048740: d9 7d f6 fnstcw -0xa(%ebp)
> 8048743: 0f b7 45 f6 movzwl -0xa(%ebp),%ea
> 8048747: b4 0c mov $0xc,%ah
> 8048749: 66 89 45 f4 mov %ax,-0xc(%ebp)
> 804874d: d9 6d f4 fldcw -0xc(%ebp)
> 8048750: db 5d f0 fistpl -0x10(%ebp)
> 8048753: d9 6d f6 fldcw -0xa(%ebp)
> 8048756: 8b 45 f0 mov -0x10(%ebp),%e
> 8048759: 89 45 f0 mov %eax,-0x10(%eb
> 804875c: db 45 f0 fildl -0x10(%ebp)
> 804875f: 8b 45 18 mov 0x18(%ebp),%ea
> 8048762: dd 18 fstpl (%eax)
> 8048764: c9 leave
> 8048765: c3 ret

Edit: It would seem that, at least a long time back, valgrind's floating point calculations wheren't quite as accurate as the "real" calculations. In other words, this MAY explain why you get different results. See this question and answer on the valgrind mailing list.
Edit2: And the current valgrind.org documentation has it in it's "core limitations" section here - so I would expect that it is indeed "still valid". In other words the documentation for valgrind says to expect differences between valgrind and x87 FPU calculations. "You have been warned!" (And as we can see, using sse instructions to do the same math produces the same result as valgrind, confirming that it's a "rounding from 80 bits to 64 bits" difference)
Floating point calculations WILL differ slightly depending on exactly how the calculation is performed. I'm not sure exactly what you want to have an answer to, so here's a long rambling "answer of a sort".
Valgrind DOES indeed change the exact behaviour of your program in various ways (it emulates certain instructions, rather than actually executing the real instructions - which may include saving the intermediate results of calculations). Also, floating point calculations are well known to "not be precise" - it's just a matter of luck/bad luck if the calculation comes out precise or not. 0.98 is one of many, many numbers that can't be described precisely in floating point format [at least not the common IEEE formats].
By adding:
cerr<<"c="<<std::setprecision(30)<<c <<"\n";
we see that the output is c=0.979999999999999982236431605997 (yes, the actual value is 0.979999...99982 or some such, remaining digits is just the residual value, since it's not an "even" binary number, there's always going to be something left.
This is the n = 2550;, c = 0.98 and q = n * c part of the code as generated by gcc:
movl $2550, -28(%ebp) ; n
fldl .LC0
fstpl -40(%ebp) ; c
fildl -28(%ebp)
fmull -40(%ebp)
fstpl -48(%ebp) ; q - note that this is stored as a rouned 64-bit value.
This is the int(q) and int(n*c) part of the code:
fildl -28(%ebp) ; n
fmull -40(%ebp) ; c
fnstcw -58(%ebp) ; Save control word
movzwl -58(%ebp), %eax
movb $12, %ah
movw %ax, -60(%ebp) ; Save float control word.
fldcw -60(%ebp)
fistpl -64(%ebp) ; Store as integer (directly from 80-bit result)
fldcw -58(%ebp) ; restore float control word.
movl -64(%ebp), %ebx ; result of int(n * c)
fldl -48(%ebp) ; q
fldcw -60(%ebp) ; Load float control word as saved above.
fistpl -64(%ebp) ; Store as integer.
fldcw -58(%ebp) ; Restore control word.
movl -64(%ebp), %esi ; result of int(q)
Now, if the intermediate result is stored (and thus rounded) from the internal 80-bit precision in the middle of one of those calculations, the result may be subtly different from the result if the calculation happens without saving intermediate values.
I get identical results from both g++ 4.9.2 and clang++ -mno-sse - but if I enable sse in the clang case, it gives the same result as 64-bit build. Using gcc -msse2 -m32 gives the 2499 answer everywhere. This indicates that the error is caused by "storing intermediate results" in some way or another.
Likewise, optimising in gcc to -O1 will give the 2499 in all places - but this is a coincidence, not a result of some "clever thinking". If you want correctly rounded integer values of your calculations, you're much better off rounding yourself, because sooner or later int(someDoubleValue) will come up "one short".
Edit3: And finally, using g++ -mno-sse -m64 will also produce the same 2498 answer, where using valgrind on the same binary produces the 2499 answer.

The 32-bit version uses X87 floating point instructions. X87 internally uses 80-bit floating point numbers, which will cause trouble when numbers are converted to and from other precisions. In your case the 64-bit precision approximation for 0.98 is slightly less than the true value. When the CPU converts it to an 80-bit value you get the exact same numerical value, which is an equally bad approximation - having more bits doesn't get you a better approximation. The FPU then multiplies that number by 2550, and gets a figure that's slightly less than 2499. If the CPU used 64-bit numbers all the way it should compute exactly 2499, like it does in the 64-bit version.

why does vs c++ 2010 compiler produce a different assembly code for similar function

So recently i was thinking about strcpy and back to K&R where they show the implementation as
while (*dst++ = *src++) ;
However I mistakenly transcribed it as:
while (*dst = *src)
{
src++; //technically could be ++src on these lines
dst++;
}
In any case that got me thinking about whether the compiler would actually produce different code for these two. My initial thought is they should be near identical, since src and dst are being incremented but never used I thought the compiler would know not to try to acually preserve them as "variables" in the produced machine code.
Using windows7 with VS 2010 C++ SP1 building in 32 bit Release mode (/O2), I got the dis-assembly code for both of the above incarnations. To prevent the function itself from referencing the input directly and being inlined i made a dll with each of the functions. I have omitted the prologue and epilogue of the produced ASM.
while (*dst++ = *src++)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8B 45 0C mov eax,dword ptr [dst]
6EBB1009 2B D0 sub edx,eax //prepare edx so that edx + eax always points to src
6EBB100B EB 03 jmp docopy+10h (6EBB1010h)
6EBB100D 8D 49 00 lea ecx,[ecx] //looks like align padding, never hit this line
6EBB1010 8A 0C 02 mov cl,byte ptr [edx+eax] //ptr [edx+ eax] points to char in src :loop begin
6EBB1013 88 08 mov byte ptr [eax],cl //copy char to dst
6EBB1015 40 inc eax //inc src ptr
6EBB1016 84 C9 test cl,cl // check for 0 (null terminator)
6EBB1018 75 F6 jne docopy+10h (6EBB1010h) //if not goto :loop begin
;
Above I have annotated the code, essentially a single loop , only 1 check for null and 1 memory copy.
Now lets look at my mistake version:
while (*dst = *src)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8A 0A mov cl,byte ptr [edx]
6EBB1008 8B 45 0C mov eax,dword ptr [dst]
6EBB100B 88 08 mov byte ptr [eax],cl //copy 0th char to dst
6EBB100D 84 C9 test cl,cl //check for 0
6EBB100F 74 0D je docopy+1Eh (6EBB101Eh) // return if we encounter null terminator
6EBB1011 2B D0 sub edx,eax
6EBB1013 8A 4C 02 01 mov cl,byte ptr [edx+eax+1] //get +1th char :loop begin
{
src++;
dst++;
6EBB1017 40 inc eax
6EBB1018 88 08 mov byte ptr [eax],cl //copy above char to dst
6EBB101A 84 C9 test cl,cl //check for 0
6EBB101C 75 F5 jne docopy+13h (6EBB1013h) // if not goto :loop begin
}
In my version, I see that it first copies the 0th char to the destination, then checks for null , and then finally enters the loop where it checks for null again. So the loop remains largely the same but now it handles the 0th character before the loop. This of course is going to be sub-optimal compared with the first case.
I am wondering if anyone knows why the compiler is being prevented from making the same (or near same) code as the first example. Is this a ms compiler specific issue or possibly with my compiler/linker settings?
here is the full code, 2 files (1 function replaces the other).
// in first dll project
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst++ = *src++);
}
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst = *src)
{
++src;
++dst;
}
}
//seprate main.cpp file calls docopy
void docopy(const char* src, char* dst);
char* source ="source";
char destination[100];
int main()
{
docopy(source, destination);
}

Because in the first example, the post-increment happens always, even if src starts out pointing to a null character. In the same starting situation, the second example would not increment the pointers.

Of course the compiler has other options. The "copy first byte then enter the loop if not 0" is what gcc-4.5.1 produces with -O1. With -O2 and -O3, it produces
.LFB0:
.cfi_startproc
jmp .L6 // jump to copy
.p2align 4,,10
.p2align 3
.L4:
addq $1, %rdi // increment pointers
addq $1, %rsi
.L6: // copy
movzbl (%rdi), %eax // get source byte
testb %al, %al // check for 0
movb %al, (%rsi) // move to dest
jne .L4 // loop if nonzero
rep
ret
.cfi_endproc
which is quite similar to what it produces for the K&R loop. Whether that's actually better I can't say, but it looks nicer.
Apart from the jump into the loop, the instructions for the K&R loop are exactly the same, just ordered differently:
.LFB0:
.cfi_startproc
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax // get source byte
addq $1, %rdi // increment source pointer
movb %al, (%rsi) // move byte to dest
addq $1, %rsi // increment dest pointer
testb %al, %al // check for 0
jne .L2 // loop if nonzero
rep
ret
.cfi_endproc

Your second code doesn't "check for null again". In your second version the cycle body works with the characters at edx+eax+1 address (note the +1 part), which would be characters number 1, 2, 3 and so on. The prologue code works with character number 0. That means that the code never checks the same character twice, as you seem to believe. There's no "again" there.
The second code is a bot more convoluted (the first iteration of the cycle is effectively pulled out of it) since, as it has already been explained, its functionality is different. The final values of the pointers differ between your fist and your second version.

Benefit of using short instead of int in for... loop

is there any benefit to using short instead of int in a for loop?
i.e.
for(short j = 0; j < 5; j++) {
99% of my loops involve numbers below 3000, so I was thinking ints would be a waste of bytes. Thanks!

No, there is no benefit. The short will probably end up taking a full register (which is 32 bits, an int) anyway.
You will lose hours typing the extra two letters in the IDE, too. (That was a joke).

No. The loop variable will likely be allocated to a register, so it will end up taking up the same amount of space regardless.

Look at the generated assembler code and you would probably see that using int generates cleaner code.
c-code:
#include <stdio.h>
int main(void) {
int j;
for(j = 0; j < 5; j++) {
printf("%d", j);
}
}
using short:
080483c4 <main>:
80483c4: 55 push %ebp
80483c5: 89 e5 mov %esp,%ebp
80483c7: 83 e4 f0 and $0xfffffff0,%esp
80483ca: 83 ec 20 sub $0x20,%esp
80483cd: 66 c7 44 24 1e 00 00 movw $0x0,0x1e(%esp)
80483d4: eb 1c jmp 80483f2 <main+0x2e>
80483d6: 0f bf 54 24 1e movswl 0x1e(%esp),%edx
80483db: b8 c0 84 04 08 mov $0x80484c0,%eax
80483e0: 89 54 24 04 mov %edx,0x4(%esp)
80483e4: 89 04 24 mov %eax,(%esp)
80483e7: e8 08 ff ff ff call 80482f4 <printf#plt>
80483ec: 66 83 44 24 1e 01 addw $0x1,0x1e(%esp)
80483f2: 66 83 7c 24 1e 04 cmpw $0x4,0x1e(%esp)
80483f8: 7e dc jle 80483d6 <main+0x12>
80483fa: c9 leave
80483fb: c3 ret
using int:
080483c4 <main>:
80483c4: 55 push %ebp
80483c5: 89 e5 mov %esp,%ebp
80483c7: 83 e4 f0 and $0xfffffff0,%esp
80483ca: 83 ec 20 sub $0x20,%esp
80483cd: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
80483d4: 00
80483d5: eb 1a jmp 80483f1 <main+0x2d>
80483d7: b8 c0 84 04 08 mov $0x80484c0,%eax
80483dc: 8b 54 24 1c mov 0x1c(%esp),%edx
80483e0: 89 54 24 04 mov %edx,0x4(%esp)
80483e4: 89 04 24 mov %eax,(%esp)
80483e7: e8 08 ff ff ff call 80482f4 <printf#plt>
80483ec: 83 44 24 1c 01 addl $0x1,0x1c(%esp)
80483f1: 83 7c 24 1c 04 cmpl $0x4,0x1c(%esp)
80483f6: 7e df jle 80483d7 <main+0x13>
80483f8: c9 leave
80483f9: c3 ret

More often than not, trying to optimize for this will just exacerbate bugs when someone doesn't notice (or forgets) that it's a narrow data type. For instance, check out this bcrypt problem I looked into...pretty typical:
BCrypt says long, similar passwords are equivalent - problem with me, the gem, or the field of cryptography?
Yet the problem is still there as int is a finite size as well. Better to spend your time making sure your program is correct and not creating hazards or security problems from numeric underflows and overflows.
Some of what I talk about w/numeric_limits here might be informative or interesting, if you haven't encountered that yet:
http://hostilefork.com/2009/03/31/modern_cpp_or_modern_art/

Nope. Chances are your counter will end up in a register anyway, and they are typically at least the same size as int

I think there isn't much difference. Your compiler will probably use an entire 32-bit register for the counter variable (in 32-bit mode). You'll waste just two bytes from the stack, at most, in the worst case (when not used a register)

One potential improvement over int as loop counter is unsigned int (or std::size_t where applicable) if the loop index is never going to be negative. Using short instead of int makes no difference in most compilers, here's the ones I have.
Code:
volatile int n;
int main()
{
for(short j = 0; j < 50; j++) // replaced with int in test2
n = j;
}
g++ 4.5.2 -march=native -O3 on x86_64 linux
// using short j // using int j
.L2: .L2:
movl %eax, n(%rip) movl %eax, n(%rip)
incl %eax incl %eax
cmpl $50, %eax cmpl $50, %eax
jne .L2 jne .L2
clang++ 2.9 -march=native -O3 on x86_64 linux
// using short j // using int j
.LBB0_1: .LBB0_1:
movl %eax, n(%rip) movl %eax, n(%rip)
incl %eax incl %eax
cmpl $50, %eax cmpl $50, %eax
jne .LBB0_1 jne .LBB0_1
Intel C++ 11.1 -fast on x86_64 linux
// using short j // using int j
..B1.2: ..B1.2:
movl %eax, n(%rip) movl %eax, n(%rip)
incl %edx incl %eax
movswq %dx, %rax cmpl $50, %eax
cmpl $50, %eax jl ..B1.2
jl ..B1.2
Sun C++ 5.8 -xO5 on sparc
// using short j // using int j
.L900000105: .L900000105:
st %o4,[%o5+%lo(n)] st %o4,[%o5+%lo(n)]
add %o4,1,%o4 add %o4,1,%o4
cmp %o4,49 cmp %o4,49
ble,pt %icc,.L900000105 ble,pt %icc,.L900000105
So of the four compilers I have, only one even had any difference in the result, and, it actually used less bytes in case of int.

As most others have said, computationally there is no advantage and might be worse. However, if the loop variable is used in a computation requiring a short, then it might be justified:
for(short j = 0; j < 5; j++)
{
// void myfunc(short arg1);
myfunc(j);
}
All this really does is prevent a warning message as the value passed would be promoted to an int (depending on compiler, platform, and C++ dialect). But it looks cleaner, IMHO.
Certainly not worth obsessing over. If you are looking to optimize, remember the rules (forget who came up with these):
Don't
Failing Step 1, first Measure
Make a change
If bored, exit, else go to Step 2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js