Performance of single/double precision SpMV on CPUs - c++

The sparse matrix-vector product is a memory bound operation due to a very low arithmetic intensity. Since a float storage format would require 4+4=8 bytes per non-zero compared to 4+8=12 bytes for doubles (value and column index), one should be able to expect about 33% faster execution when switching to floats. I constructed a benchmark which assembles a 1000000x1000000 matrix with 200 non-zeros per row, and then take the minimum from 20 multiplications. Source code on github here.
The results are roughly what I expected. When I run the benchmark on my Intel Core i7-2620M, I see something like a 30% faster execution. The small difference can be seen in the bandwidth drop from about 19.0 GB/s (doubles) to about 18.0 GB/s (floats) out of the 21.3 GB/s in the spec.
Now, since the data for the matrix is almost 1000 larger than that for the vectors, one would expect that the faster performance should be obtained also for the case where only the matrix is in single precision, but where the vectors remains as doubles. I went on to try this, and then made sure to use the lower precision for the computations. When I run it, however, effective bandwidth usage suddenly drops to about 14.4 GB/s, giving only a 12% faster execution than the full double version. How can one understand this?
I'm using Ubuntu 14.04 with GCC 4.9.3.
Run times:
// double(mat)-double(vec)
Wall time: 0.127577 s
Bandwidth: 18.968 GB/s
Compute: 3.12736 Gflop/s
// float(mat)-float(vec)
Wall time: 0.089386 s
Bandwidth: 18.0333 GB/s
Compute: 4.46356 Gflop/s
// float(mat)-double(vec)
Wall time: 0.112134 s
Bandwidth: 14.4463 GB/s
Compute: 3.55807 Gflop/s
Update
See the answer by Peter Cordes below. In short, dependencies between loop iterations from the double-to-float conversion are responsible for the overhead. By unrolling the loop (see unroll-loop branch at github), full bandwidth usage is regained for both the float-double and the float-float versions!
New run times:
// float(mat)-float(vec)
Wall time: 0.084455 s
Bandwidth: 19.0861 GB/s
Compute: 4.72417 Gflop/s
// float(mat)-double(vec)
Wall time: 0.0865598 s
Bandwidth: 18.7145 GB/s
Compute: 4.6093 Gflop/s

The double-float loop that has to convert on the fly can't issue quite as fast. With some loop unrolling, gcc would probably do a better job.
Your i7-2620M is a dual-core with hyperthreading. Hyperthreading doesn't help when the bottleneck is CPU uop throughput, rather than branch mispredicts, cache misses, or even just long latency chains. Saturating your memory bandwidth with just scalar operations isn't easy.
From the asm output for your code on the Godbolt Compiler Explorer: gcc 5.3 makes about the same inner loop, BTW, so you're not losing out on much in this case by using an old gcc version.
double-double version inner loop (gcc 4.9.3 -O3 -march=sandybridge -fopenmp):
## inner loop of <double,double>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
add eax, 1 # 1 uop
mov ecx, DWORD PTR [r9+rdx*4] # 1 uop
vmovsd xmm0, QWORD PTR [r10+rdx*8] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [r8+rcx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, esi # (macro-fused)
jne .L7 # 1 uop
total: 8 fused-domain uops, can issue at one iter per two clocks. It can also execute that fast: Three of the uops are loads, and SnB can do 4 loads per 2 clocks. 5 ALU uops are left (since SnB can't eliminate reg-reg moves in the rename stage, that was introduced with IvB). Anyway, there are no obvious bottlenecks on a single execution port. SnB's three ALU ports could handle up to six ALU uops per two cycles.
There's no micro-fusion because of using two-register addressing modes.
double-float version inner loop:
## inner loop of <double,float>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
vxorpd xmm0, xmm0, xmm0 # 1 uop (no execution unit needed).
add eax, 1 # 1 uop
vcvtss2sd xmm0, xmm0, DWORD PTR [r9+rdx*4] # 2 uops
mov edx, DWORD PTR [r8+rdx*4] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [rsi+rdx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, ecx # (macro-fused)
jne .L7 # 1 uop
gcc uses the xorpd to break the loop-carried dependency chain. cvtss2sd has a false dependency on the old value of xmm0, because it's badly designed and doesn't zero the top half of the register. (movsd when used as a load does zero, but not when used as a reg-reg move. In that case, use movaps unless you want merging.)
So, 10 fused-domain uops: can issue at one iteration per three clocks. I assume this is the only bottleneck, since it's just one extra ALU uop that needs an execution port. (SnB handles zeroing-idioms in the rename stage, so xorpd doesn't need one). cvtss2sd is a 2 uop instruction that apparently can't micro-fuse even if gcc used a one-register addressing mode. It has a throughput of one per clock. (On Haswell, it's a 2 uop instruction when used with a register src and dest, and on Skylake the throughput is reduced to one per 2 clocks, according to Agner Fog's testing.) That still wouldn't be a bottleneck for this loop on Skylake, though. It's still 10 fused-domain uops on Haswell / Skylake, and that's still the bottleneck.
-funroll-loops should help with gcc 4.9.3
gcc does a moderately good job, with code like
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+2] # D.56355,
vcvtss2sd xmm6, xmm4, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm2, xmm1, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm14, xmm0, xmm13 # tmp, tmp, D.56358
vxorpd xmm1, xmm1, xmm1 # D.56358
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+3] # D.56355,
vcvtss2sd xmm10, xmm9, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm7, xmm6, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm3, xmm14, xmm2 # tmp, tmp, D.56358
vxorpd xmm6, xmm6, xmm6 # D.56358
Without the loop overhead, the work for each element is down to 8 fused-domain uops, and it's not a tiny loop that suffers from only issuing 2 uops every 3rd cycle (because 10 isn't a multiple of 4).
It could save the lea instructions by using displacements, e.g. [r8+rax*4 + 12]. IDK why gcc chooses not to.
Not even -ffast-math gets it to vectorize at all. There's probably no point since doing a gather from the sparse matrix would outweigh the benefit of doing a load of 4 or 8 contiguous values from the non-sparse vector. (insertps from memory is a 2-uop instruction that can't micro-fuse even with one-register addressing modes.)
On Broadwell or Skylake, vgatherdps may be fast enough to give a speedup over. Probably a big speedup on Skylake. (Can gather 8 single-precision floats with a throughput of 8 floats per 5 clocks. Or vgatherqpd can gather 4 double-precision floats with a throughput of 4 doubles per 4 clocks). This sets you up for a 256b vector FMA.

Related

Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instructions

I have a float array containing A,B,C,D 4 float numbers and I wish to load them into a __m256 variable like AABBCCDD. What's the best way to do this?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions. Thanks.
If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps (_mm256_permutexvar_ps) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0).
vpermps ymm is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)
If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c) (1x vshufps) to set up for a vinsertf128.
But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.
Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.
__m256 duplicate4floats(void *p) {
__m256 v = _mm256_broadcast_ps((const __m128 *) p); // vbroadcastf128
v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2, 1,1, 0,0)); // vpermilps
return v;
}
Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)
So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128 is 2 uops, vs. 1 for a 128-bit vmovups, but vpermps (lane-crossing) is 3 uops vs. 2 for vpermilps.
Unfortunately, clang pessimizes this into a vmovups load and a vpermps ymm, but GCC compiles it as written. (Godbolt)
If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem] (2 uops on Intel) could get the elements set up for vmovsldup (1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps then blend?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.
Get a better compiler, then! (Or remember to enable optimization.)
__m256 duplicate4floats_naive(const float *p) {
return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}
compiles with gcc (https://godbolt.org/z/dMzh3fezE) into
duplicate4floats_naive(float const*):
vmovups xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 80
vpermilps xmm1, xmm1, 250
vinsertf128 ymm0, ymm0, xmm1, 0x1
ret
So 3 shuffle uops, not great. And it could have used vshufps instead of vpermilps to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.
clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.
duplicate4floats_naive(float const*):
vmovups xmm0, xmmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
vpermps ymm0, ymm1, ymm0
ret
_mm_load_ps -> _mm256_castps128_ps256 -> _mm256_permute_ps

Clang generates worse code for 7 comparisons than for 8 comparisons

I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange.
Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.
bool f1(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42) | (x==47);
}
bool f2(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42);
}
My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.
godbolt link here:
# clang(trunk) -O2 -march=haswell
f1(short):
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # 16 bytes = 8 shorts
vpacksswb xmm0, xmm0, xmm0
vpmovmskb eax, xmm0
test al, al
setne al # booleanize the parallel-compare bitmask
ret
vs.
f2(short):
cmp di, -1
sete r8b
cmp edi, 150
sete dl
cmp di, 5 # scalar checks of 3 conditions
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # low 8 bytes = 4 shorts
sete al
vpmovsxwd xmm0, xmm0
vmovmskps esi, xmm0
test sil, sil
setne cl # SIMD check of the other 4
or al, r8b
or al, dl
or al, cl # and combine.
ret
quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)
It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/
The asm for f1() is clearly better than f2(): vpacksswb xmm has the same cost as vpmovsxwd xmm on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx -> vmovmskps could have bypass latency between integer and FP domains1.
Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.
You don't need AVX2, but word-broadcast in one instruction without a pshufb control vector does make it more efficient. And clang chooses pshuflw -> pshufd for -march=nehalem
Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.
Instead of test al, al, it's possible to select which bits you want to check with test sil, 0b00001010 for example, to check bits 1 and 3 but ignore non-zero bits in other positions.
pcmpeqw sets both bytes the same inside a word element so it's fine to pmovmskb that result and get an integer with pairs of bits.
There's also zero benefit to using a byte register instead of a dword register: test sil,sil should avoid the REX prefix and use test esi,esi.
So even without duplicating one of the conditions, f2() could be:
f2:
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmovmskb eax, xmm0
test eax, 0b011111111111111 # (1<<15) - 1 = low 14 bits set
setne al
ret
That test will set ZF according to the low 14 bits of the pmovmksb result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.
But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax like a normal person. Compressing the mask to fit in 8-bit AL is a total waste of time and code-size. test r32, r32 is just as fast as test r8,r8 and doesn't need a REX prefix for SIL, DIL, or BPL.
Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi to combine the movd and broadcast.
Or to compare only 4 elements, instead of extra shuffling for movmskps, we only need SSE2 here. And using a mask truly is useful.
test_4_possibilities_SSE2:
movd xmm0, edi
pshufd xmm0, xmm0, 0 # set1_epi32(x)
pcmpeqw xmm0, [const] # == set_epi32(a, b, c, d)
pmovmskb eax, xmm0
test eax, 0b0001000100010001 # the low bit of each group of 4
setne al
ret
We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test lets us do that more cheaply than any extra instruction would.
Without AVX2, a SIMD dword broadcast with pshufd is cheaper than needing a word broadcast.
Another option is to imul with 0x00010001 to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd -> pshufd
Inside a loop, though, it would be worth loading a control vector for pshufb (SSSE3) instead of using 2 shuffles or an imul.

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];
This code works, but I want to replace it with SSE/AVX instruction.
It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.
For a dot-product of an array larger than one vector, sum vertically (into multiple accumulators), only hsumming once at the end.
For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction) - extract the high half and add to the low half. Repeat until you're down to 1 element.
If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.
For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:
#include <immintrin.h>
inline
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.
Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer
vmovapd xmm1, xmm0 # silly compiler, vextract to xmm1 instead
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm1, xmm0
vunpckhpd xmm1, xmm0, xmm0 # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
vaddsd xmm0, xmm0, xmm1 # scalar means we never raise FP exceptions for results we don't use
vzeroupper
ret
After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.
On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.
#PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)
On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.
# Paul's version # Ryzen # Skylake
vhaddpd ymm0, ymm0, ymm0 # 8 uops # 3 uops
vperm2f128 ymm1, ymm0, ymm0, 49 # 8 uops # 1 uop
vaddpd ymm0, ymm0, ymm1 # 2 uops # 1 uop
# total uops: # 18 # 5
vs.
# my version with vmovapd optimized out: extract to a different reg
vextractf128 xmm1, ymm0, 0x1 # 1 uop # 1 uop
vaddpd xmm0, xmm1, xmm0 # 1 uop # 1 uop
vunpckhpd xmm1, xmm0, xmm0 # 1 uop # 1 uop
vaddsd xmm0, xmm0, xmm1 # 1 uop # 1 uop
# total uops: # 4 # 4
Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)
The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.
Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.
You can do it like this:
acc = _mm256_hadd_pd(acc, acc); // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31)); // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double
Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.
In gcc and clang SIMD types are built-in vector types. E.g.:
# avxintrin.h
typedef double __m256d __attribute__((__vector_size__(32), __aligned__(32)));
These built-in vectors support indexing, so you can write it conveniently and leave it up to the compiler to make good code:
double hsum_double_avx2(__m256d v) {
return v[0] + v[1] + v[2] + v[3];
}
clang-14 -O3 -march=znver3 -ffast-math generates the same assembly as it does for Peter Cordes's intrinsics:
# clang -O3 -ffast-math
hsum_double_avx2:
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
Unfortunately gcc does much worse, which generates sub-optimal instructions, not taking advantage of the freedom to re-associate the 3 + operations, and using vhaddpd xmm to do the v[0] + v[1] part, which costs 4 uops on on Zen 3. (Or 3 uops on Intel CPUs, 2 shuffles + an add.)
-ffast-math is of course necessary for the compiler to be able to do a good job, unless you write it as (v[0]+v[2]) + (v[1]+v[3]). With that, clang still makes the same asm with -O3 -march=icelake-server without -ffast-math.
Ideally, I want to write plain code as I did above and let the compiler use a CPU-specific cost model to emit optimal instructions in right order for that specific CPU.
One reason being that labour-intensive hand-coded optimal version for Haswell may well be suboptimal for Zen3. For this problem specifically, that's not really the case: starting by narrowing to 128-bit with vextractf128 + vaddpd is optimal everywhere. There are minor variations in shuffle throughput on different CPUs; for example Ice Lake and later Intel can run vshufps on port 1 or 5, but some shuffles like vpermilps/pd or vunpckhpd still only on port 5. Zen 3 (like Zen 2 and 4) has good throughput for either of those shuffles so clang's asm happens to be good there. But it's unfortunate that clang -march=icelake-server still uses vpermilpd
A frequent use-case nowadays is computing in the cloud with diverse CPU models and generations, compiling the code on that host with -march=native -mtune=native for best performance.
In theory, if compilers were smarter, this would optimize short sequences like this to ideal asm, as well as making generally good choices for heuristics like inlining and unrolling. It's usually the best choice for a binary that will run on only one machine, but as GCC demonstrates here, the results are often far from optimal. Fortunately modern AMD and Intel aren't too different most of the time, having different throughputs for some instructions but usually being single-uop for the same instructions.

Why does ICC unroll this loop in this way and use lea for arithmetic?

Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.
I distilled down the example to this:
long count(void** x)
{
long i = 0;
while (*x)
{
++i;
x = (void**)*x;
}
return i;
}
Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:
count(void**):
xor eax, eax #6.10
mov rcx, QWORD PTR [rdi] #7.11
test rcx, rcx #7.11
je ..B1.6 # Prob 1% #7.11
mov rdx, rax #7.3
..B1.3: # Preds ..B1.4 ..B1.2
inc rdx #7.3
mov rcx, QWORD PTR [rcx] #7.11
lea rsi, QWORD PTR [rdx+rdx] #9.7
lea rax, QWORD PTR [-1+rdx*2] #9.7
test rcx, rcx #7.11
je ..B1.6 # Prob 18% #7.11
mov rcx, QWORD PTR [rcx] #7.11
mov rax, rsi #9.7
test rcx, rcx #7.11
jne ..B1.3 # Prob 82% #7.11
..B1.6: # Preds ..B1.3 ..B1.4 ..B1.1
ret #12.10
Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:
It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
It uses lea for some of the arithmetic
It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)
What are the potential benefits to doing all this? I assume it may have something to do with scheduling?
Just for comparison, this is the code generated by gcc 6.2:
count(void**):
mov rdx, QWORD PTR [rdi]
xor eax, eax
test rdx, rdx
je .L4
.L3:
mov rdx, QWORD PTR [rdx]
add rax, 1
test rdx, rdx
jne .L3
rep ret
.L4:
rep ret
This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.
Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).
Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i is pretty braindead, since it used more uops than two inc rax instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64, so out-of-order execution can get ahead on running those uops).
Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).
IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.
For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.
Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.
Here's what a hand-written unrolled-by-two implementation looks like:
Fix up i in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.
count(void**):
xor eax, eax # counter
mov rcx, QWORD PTR [rdi] # *x
test rcx, rcx
je ..B1.6
.p2align 4 # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
mov rcx, QWORD PTR [rcx]
test rcx, rcx
jz .plus1
mov rcx, QWORD PTR [rcx]
add rax, 2
test rcx, rcx
jnz .loop
..B1.6:
ret
.plus1: # exit path for odd counts
inc rax
ret
This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.
gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.
There's no definite answer to why it does it, as it is a proprietary compiler. Only intel knows why. That said, Intel compiler is often more aggressive in loop optimization. It does not mean it is better. I have seen situations where intel's aggressive inlining lead to worse performance than clang/gcc. In that case, I had to explicitly forbid inlining at some call sites. Similarly, sometime it is necessary to forbid unrolling via pragmas in Intel C++ to get better performance.
lea is a particularly useful instruction. It allows one shift, two addition, and one move all in just one instruction. It is much faster than doing these four operations separated. However, it does not always make a difference. And if lea is used only for an addition or a move, it may or may not be better. So you see in 7.11 it uses a move, while in the next two lines lea is used to do an addition plus move, and addition, shift, plus a move
I don't see there's a optional benefit here

is it possible/efficient to put fpu exception or inf into work?

I got such code
loop 10 M:
if( fz != 0.0)
{
fhx += hx/fz;
}
this is called 10 M times in loop needs to be very fast - I onlly need to catch the case when fz is not zero, not to make div by zero error, but it is a very rare case,
indeed on 10M cases it should be zero, i dont know once , twice or newer
can i in some way get rid of this 10M of ifs and use "nan/inf" or maybe catching the exception and continue? (if fz is zero i need fhx += 0.0, i mean nothing just continue
? is it possible/efficient to put fpu exception or inf into work?
(Im using c++/mingw32)
You can, but it's probably not that useful. Masking won't be useful either under the circumstances.
Exceptions are extremely slow when they happen, first a lot of microcoded complex stuff has to happen before the CPU even enters the kernel level exception handler, and then it has to hand it off to your process in a complicated and slow way too. On the other hand, they don't cost anything when they don't happen.
But a comparison and a branch don't really cost anything either, as long as the branch is predictable, which a branch that is essentially never taken is. Of course it costs a little throughput to make them happen at all, but they're not in the critical path .. but even if they were, the real problem here is a division in every iteration.
The throughput of that division is 1 per 14 cycles anyway (on Haswell - worse on other µarchs), unless fz is particularly "nice", and even then it's 1 per 8 cycles (again on Haswell). On Core2 it was more like 19 and 5, on P4 it was more like (in typical P4 fashion) one division per 71 cycles no matter what.
A well-predicted branch and a comparison just disappear into that. On my 4770K, the difference between having a comparison and branch there or not disappeared into the noise (maybe if I run it enough times I will eventually obtain a statistically significant difference, but it will be tiny), with both of them winning randomly about half the time. The code I used for this benchmark was
global bench
proc_frame bench
push r11
[endprolog]
xor ecx, ecx
mov rax, rcx
mov ecx, -10000000
vxorps xmm1, xmm1
vxorps xmm2, xmm2
vmovapd xmm3, [rel doubleone]
_bench_loop:
imul eax, ecx, -0xAAAAAAAB ; distribute zeroes somewhat randomly
shr eax, 1 ; increase to make more zeroes
vxorps xmm0, xmm0
vcvtsi2sd xmm0, eax
vcomisd xmm0, xmm1 ; #
jz _skip ; #
vdivsd xmm0, xmm3, xmm0
vaddsd xmm2, xmm0
_skip:
add ecx, 1
jnz _bench_loop
vmovapd xmm0, xmm2
pop r11
ret
endproc_frame
The other function was the same but with the two lines marked with a # commented out.
The version that eventually consistently wins when the number of zeroes is increased is the one with the branch, indicating that division by zero is significantly slower than a branch misprediction. That's without even using the exception mechanism to create a programmer-visible exception, it's just from the cost of the micro-coded "weird case fix-up" thing running. But you don't have that many zeroes, so,
TL;DR there isn't really a difference.