How to hide SHLD delay? - c++

I have a simple bit reader which uses the SHLD instruction (__shiftleft128) to read a bit stream.
This works great. However, I have been doing some profiling and I notice that whatever instruction comes after the SHLD instruction takes a lot of time.
Assembly CPU Time Instructions Retired
add r10b, r9b 19.000ms 92,000,000
cmp r10b, 0x40 58.000ms 180,000,000
jb 0x140016fa6 <Block 24>
Block 23:
and r10b, 0x3f 43.000ms 204,000,000
mov r15, r11 30.000ms 52,000,000
mov qword ptr [rbp+0x20], r11
add rbx, 0x8 16.000ms 78,000,000
mov qword ptr [rbp+0x10], rbx
mov r11, qword ptr [rbx] 6.000ms 44,000,000
bswap r11 2.000ms
mov qword ptr [rbp+0x28], r11 8.000ms 20,000,000
Block 24:
mov rdx, r15 61.000ms 208,000,000
movzx ecx, r10b 1.000ms 6,000,000
**shld** rdx, r11, cl 24.000ms 58,000,000
inc edi **127.000ms** 470,000,000
As you can see in the table above the inc instruction after the shld instruction takes a lot of time (8% CPU time).
I would like to know a bit more about why this is the case and how I can avoid it? Is there any instructions that can run in parallel with an shld on cpu level?
I remember reading about shld in some AMD optimziation manual but I can't find it again.

Hard to tell but seems like the delay is a result of some exception handling routine.
Behavior
However Intel manual specifies a few cases for shld where undefined response is invoked:-
The destination operand can be a register or a memory location; the
source operand is a register. The count operand is an unsigned integer
that can be stored in an immediate byte or in the CL register. If the
count operand is CL, the shift count is the logical AND of CL and a
count mask. In non-64-bit modes and default 64-bit mode; only bits 0
through 4 of the count are used. This masks the count to a value
between 0 and 31. If a count is greater than the operand size, the
result is undefined. If the count is 1 or greater, the CF
flag is filled with the last bit shifted out of the destination
operand and the SF, ZF, and PF flags are set according to the value of
the result. For a 1-bit shift, the OF flag is set if a sign change
occurred; otherwise, it is cleared. For shifts greater than 1 bit, the
OF flag is undefined. If a shift occurs, the AF flag is undefined. If
the count operand is 0, the flags are not affected. If the count is
greater than the operand size, the flags are undefined.
Exception for shld:-
In Protected Mode --> #GP(0),#SS(0),#PF(fault-code),#AC(0),#UD
UPDATE:: Gotcha:-->
First the definition:-
Instructions Retired — Event select C0H, Umask 00H This event counts the number of instructions at retirement. For instructions that
consist of multiple micro-ops, this event counts the retirement of the
last microop of the instruction. An instruction with a REP prefix
counts as one instruction (not per iteration). Faults before the
retirement of the last micro-op of a multiops instruction are not
counted. This event does not increment under VM-exit conditions.
Counters continue counting during hardware interrupts, traps, and
inside interrupt handlers.
inc edi **127.000ms** 470,000,000(instruction retired)
From the above definition its quite clear that either this instruction breaks into too many micro-ops or some interrupt handler is simultaneously running.

Related

Any advantage of XOR AL,AL + MOVZX EAX, AL over XOR EAX,EAX?

I have some unknown C++ code that was compiled in Release build, so it's optimized. The point I'm struggling with is:
xor al, al
add esp, 8
cmp byte ptr [ebp+userinput], 31h
movzx eax, al
This is my understanding:
xor al, al ; set eax to 0x??????00 (clear last byte)
add esp, 8 ; for some unclear reason, set the stack pointer higher
cmp byte ptr [ebp+userinput], 31h ; set zero flag if user input was "1"
movzx eax, al ; set eax to AL and extend with zeros, so eax = 0x000000??
I don't care about line 2 and 3. They might be there in this order for pipelining reasons and IMHO have nothing to do with EAX.
However, I don't understand why I would clear AL first, just to clear the rest of EAX later. The result will IMHO always be EAX = 0, so this could also be
xor eax, eax
instead. What is the advantage or "optimization" of that piece of code?
Some background info:
I will get the source code later. It's a short C++ console demo program, maybe 20 lines of code only, so nothing that I would call "complex" code. IDA shows a single loop in that program, but not around this piece. The Stud_PE signature scan didn't find anything, but likely it's Visual Studio 2013 or 2015 compiler.
xor al,al is already slower than xor eax,eax on most CPUs. e.g. on Haswell/Skylake it needs an ALU uop and doesn't break the dependency on the old value of eax/rax. It's equally bad on AMD CPUs, or Atom/Silvermont. (Well, maybe not equally because AMD doesn't eliminate xor eax,eax at issue/rename, but it still has a false dependency which could serialize the new dependency chain with whatever used eax last).
On CPUs that do rename al separately from the rest of the register (Intel pre-IvyBridge), the xor al,al may still be recognized as a zeroing idiom, but unless you actively want to preserve the upper bytes of the register, the best way to zero al is xor eax,eax.
Doing movzx on top of that just makes it even worse.
I'm guessing your compiler somehow got confused and decided it needed a 1-byte zero, but then realized it needed to promote it to 32 bits. xor sets flags, so it couldn't xor-zero after the cmp, and it failed to notice that it could have just xor-zeroed eax before the cmp.
Either that or it's something like Jester's suggestion, where the movzx is a branch target. Even if that's the case, xor eax,eax would still have been better because zero-extending into eax follows unconditionally on this code path.
I'm curious what compiler produced this from what source.

how to force the use of cmov in gcc and VS

I have this simple binary search member function, where lastIndex, nIter and xi are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?
If not, can anybody suggest how to write a gcc assembler template for this?
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
Thanks
As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).
This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.
As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.
The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
The used constraints are:
hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
xi[mid] and z are in the "read" list;
z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.

AND operator + addition faster than a subtraction

I've measured the execution time of following codes:
volatile int r = 768;
r -= 511;
volatile int r = 768;
r = (r & ~512) + 1;
assembly:
mov eax, DWORD PTR [rbp-4]
sub eax, 511
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
and ah, 253
add eax, 1
mov DWORD PTR [rbp-4], eax
the results:
Subtraction time: 141ns
AND + addition: 53ns
I've run the snippet multiple times with consistent results.
Can someone explain me why is this the case even tho there is one more line of assembly for AND + addition version?
Your assertion that one snippet is faster than the other is mistaken.
If you look at the code:
mov eax, DWORD PTR [rbp-4]
....
mov DWORD PTR [rbp-4], eax
You'll see that the running time is dominated by the load/store to memory.
Even on Skylake this will take 2+2 = 4 cycles minimum.
The 1 cycles that the sub or the 3*) cycles that the and bytereg/add full reg takes simply disappears into memory access time.
On older processors such as Core2 it takes 5 cycles minimum to do a load/store pair to the same address.
It is difficult to time such short sequences of code and care should be taken to ensure you have the correct methodology.
You also need to remember that rdstc is not accurate on Intel processors and runs out of order to boot.
If you use proper timing code like:
.... x 100,000 //stress the cpu using integercode in a 100,000 x loop to ensure it's running at 100%
cpuid //serialize instruction to make sure rdtscp does not run early.
rdstcp //use the serializing version to ensure it does not run late
push eax
push edx
mov reg1,1000*1000 //time a minimum of 1,000,000 runs to ensure accuracy
loop:
... //insert code to time here
sub reg1,1 //don't use dec, it causes a partial register stall on the flags.
jnz loop //loop
//kernel mode only!
//mov eax,cr0 //reading and writing to cr0 serializes as well.
//mov cr0,eax
cpuid //serialization in user mode.
rdstcp //make sure to use the 'p' version of rdstc.
push eax
push edx
pop 4x //retrieve the start and end times from the stack.
Run the timing code a 100x and take the lowest cycle count.
Now you'll have an accurate count to within 1 or 2 cycles.
You'll want to time an empty loop as well and subtract the times for that so that you can see the net time spend executing the instructions of interest.
If you do this you'll discover that add and sub run at exactly the same speed, just like it does/did in every x86/x64 CPU since the 8086.
This, of course, is also what Agner Fog, the Intel CPU manuals, the AMD cpu manuals, and just about any other source available say.
*) and ah,value takes 1 cycle, then the CPU stalls for 1 cycle due the partial register write and the add eax,value takes another cycle.
Optimized code
sub DWORD PTR [rbp-4],511
Might be faster if you don't need to reuse the value elsewhere, the latency is slow at 5 cycles, but the reciprocal throughput is 1 cycle, which is much better than either of your versions.
The full machine code is
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
2d ff 01 00 00 sub eax,0x1ff
89 45 fc mov DWORD PTR [rbp-0x4],eax
vs
8b 45 fc mov eax,DWORD PTR [rbp-0x4]
80 e4 fd and ah,0xfd
83 c0 01 add eax,0x1
89 45 fc mov DWORD PTR [rbp-0x4],eax
This means for the code for the secound operation is in fact only one byte longer (11 vs 12). Most likely the CPU fetches code in larger units them bytes, so fetching isn't much slower. Also it can decode multiple instructions at the same time, so there the first sample doesn't have an advantage either. Executing a single add, and or sub each takes up a single ALU pass so they all take only one clock on a single execution unit. That's a 1 ns advantage for you sub on a 1GHz CPU.
So basically both operations are more or less the same. The difference may be attributed to some other factors. Maybe memory cell rbp-0x4 is still in L1 cache before your run the secound code sniplet. Or the instructions for the first sniplet are located worse reachable in memory. Or the CPU was able to run the secound sniplet speculativly before you started measuring etc., you would need to know how you measured the speed etc. to decide that.

Why does ICC unroll this loop in this way and use lea for arithmetic?

Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.
I distilled down the example to this:
long count(void** x)
{
long i = 0;
while (*x)
{
++i;
x = (void**)*x;
}
return i;
}
Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:
count(void**):
xor eax, eax #6.10
mov rcx, QWORD PTR [rdi] #7.11
test rcx, rcx #7.11
je ..B1.6 # Prob 1% #7.11
mov rdx, rax #7.3
..B1.3: # Preds ..B1.4 ..B1.2
inc rdx #7.3
mov rcx, QWORD PTR [rcx] #7.11
lea rsi, QWORD PTR [rdx+rdx] #9.7
lea rax, QWORD PTR [-1+rdx*2] #9.7
test rcx, rcx #7.11
je ..B1.6 # Prob 18% #7.11
mov rcx, QWORD PTR [rcx] #7.11
mov rax, rsi #9.7
test rcx, rcx #7.11
jne ..B1.3 # Prob 82% #7.11
..B1.6: # Preds ..B1.3 ..B1.4 ..B1.1
ret #12.10
Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:
It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
It uses lea for some of the arithmetic
It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)
What are the potential benefits to doing all this? I assume it may have something to do with scheduling?
Just for comparison, this is the code generated by gcc 6.2:
count(void**):
mov rdx, QWORD PTR [rdi]
xor eax, eax
test rdx, rdx
je .L4
.L3:
mov rdx, QWORD PTR [rdx]
add rax, 1
test rdx, rdx
jne .L3
rep ret
.L4:
rep ret
This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.
Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).
Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i is pretty braindead, since it used more uops than two inc rax instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64, so out-of-order execution can get ahead on running those uops).
Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).
IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.
For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.
Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.
Here's what a hand-written unrolled-by-two implementation looks like:
Fix up i in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.
count(void**):
xor eax, eax # counter
mov rcx, QWORD PTR [rdi] # *x
test rcx, rcx
je ..B1.6
.p2align 4 # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
mov rcx, QWORD PTR [rcx]
test rcx, rcx
jz .plus1
mov rcx, QWORD PTR [rcx]
add rax, 2
test rcx, rcx
jnz .loop
..B1.6:
ret
.plus1: # exit path for odd counts
inc rax
ret
This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.
gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.
There's no definite answer to why it does it, as it is a proprietary compiler. Only intel knows why. That said, Intel compiler is often more aggressive in loop optimization. It does not mean it is better. I have seen situations where intel's aggressive inlining lead to worse performance than clang/gcc. In that case, I had to explicitly forbid inlining at some call sites. Similarly, sometime it is necessary to forbid unrolling via pragmas in Intel C++ to get better performance.
lea is a particularly useful instruction. It allows one shift, two addition, and one move all in just one instruction. It is much faster than doing these four operations separated. However, it does not always make a difference. And if lea is used only for an addition or a move, it may or may not be better. So you see in 7.11 it uses a move, while in the next two lines lea is used to do an addition plus move, and addition, shift, plus a move
I don't see there's a optional benefit here

Why does this function push RAX to the stack as the first operation?

In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?
The code:-
#include <functional>
void f(std::function<void()> a)
{
a();
}
Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:
f(std::function<void ()>): # #f(std::function<void ()>)
push rax
cmp qword ptr [rdi + 16], 0
je .LBB0_1
add rsp, 8
jmp qword ptr [rdi + 24] # TAILCALL
.LBB0_1:
call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()> wrapper for comparison:
void g(void(*a)())
{
a();
}
The trivial:
g(void (*)()): # #g(void (*)())
jmp rdi # TAILCALL
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.
call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.
(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).
The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is
passed on stack) byte boundary. In other words, the value (%rsp + 8) is always
a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.
As #CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
In other cases, clang typically fixes up the stack before returning with a pop rcx.
Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)
When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)
A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx can be more dangerous:
The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.
It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.
It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)
pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.