imul then mov vs mov then imul - any difference? - c++

If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).
CPU with mov-elimination
mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?
All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)
But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

Related

Strange uses of movzx by Clang and GCC

I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
with GCC 12 -O3:
add2bytes(unsigned char*, unsigned char*):
movzx eax, BYTE PTR [rsi]
add al, BYTE PTR [rdi]
movzx eax, al
ret
If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.
with clang 14 -O3, it's even more weird:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1
GCC generates both sensible and strange movzx, and Clang's use of mov r8 m and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.
The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.
Do the compilers really use movzx poorly here, or am I missing something?
Edit: I have opened bug reports for Clang and GCC:
https://github.com/llvm/llvm-project/issues/56498
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Temporary workarounds using inline assembly:
https://godbolt.org/z/7qob8G3j7
#define addb(a, b) asm (\
"addb %1, %b0"\
: "+r"(a) : "mi"(b))
int add2bytes(uint8_t* a, uint8_t* b) {
int ret = *a;
addb(ret, *b);
return ret;
}
Now Clang -O3 produces:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
movzx eax, byte ptr [rdi]
add al, byte ptr [rsi]
ret
Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).
The optimal asm is what you suggest, movzx load, then byte add, leaving a uint8_t result in the low byte, correctly zero-extended to int as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)
A movzx is necessary somewhere, but it can be in the initial load. (A movzx is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx right after.)
There are basically three relevant behaviours here, among x86-64 CPUs.
Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)
Returning an int when the last instruction wrote al will likely lead to a penalty when the caller reads EAX. But mov al, [rdi] can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.
mov al, [rdi] still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al result with the high bytes of RAX from movzx eax, [rdi]) will get inserted just as cheaply as if we'd put a movzx eax, al in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.
mov al, [rdi] sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.
Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.
GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx to avoid a false dep writing a partial reg. And a final movzx to avoid a partial-register stall in the caller.
But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov-elimination on movzx r32, r8, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi on a function arg, for example.
movzx edx, byte ptr [rdi]
add dl, [rsi]
movzx eax, dl # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).
If optimizing specifically for Core2 / Nehalem, you'd do this:
xor eax, eax # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
mov al, [rdi]
add al, [rsi]
That's not bad on later CPUs, although the mov al, [rdi] would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx if you pick different registers.
GCC's choice to use movzx because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32 is to make code for old 32-bit-only CPUs.)
It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march / -mtune= k8 through znver3, or silvermont-family, or sandybridge or newer.
(Also note that some choices which should differ based on -mtune setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2 still doesn't know to avoid partial-register stalls!)
Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).
Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx, then realized at the end it needed to zero-extend the result.
An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.
Probably in general better to start doing narrow loads with movzx so that's more normally the case.
You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues
Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)
See also:
Why doesn't GCC use partial registers?
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
https://agner.org/optimize/ (especially his microarch guide re: partial-register details for P6-family CPUs. Last I looked, the guide incorrectly said Haswell doesn't have zero-latency movzx eax, dl, and that AH-merging was free; see my Q&A about HSW/SKL. But Agner's guide is accurate AFAIK for earlier CPUs.)
https://uops.info/ (front-end vs. back-end vs. latency costs for different instructions)
What is the best way to set a register to zero in x86 assembly: xor, mov or and? - including the part about avoiding partial-register stalls on P6, how xor eax,eax sets some kind of internal EAX=AL flag.
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me.
clang's mov ecx, edx zero-extension from 32 to 64 instead of from 8 to 64 is because it depends on an unofficial extension to the x86-64 SysV calling convention, that narrow args are extended to 32-bit. AMD Zen CPUs can do mov-elimination on mov ecx, edx but not for movzx-byte, so this is actually more efficient, as well as saving code-size.
(GCC and clang both make callers that respect this unofficial calling-convention feature, but only clang makes callees that depend on it. ICC doesn't do either so is not ABI-compatible with clang.)
Extension to intptr_t is of course necessary for all narrower args if you're going to index an array with one. (In abstract C terms, this is just part of using the value for pointer math). High garbage is allowed in at least the high 32 bits of the 64-bit register.
The clang bit actually seems reasonable. You get a partial register stall if you write to al and then read from eax. Using movzx breaks this partial register stall.
The initial mov to al has no dependencies on existing values of eax (due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).
In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.
[EDIT]
As for GCC, it seems a particularly bad choice. If it had chosen bl as the temporary register, that last movzx would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.
The final MOVZX is mandated by the fact that the function returns an int, extended from a byte. It must be there in the clang version, but with gcc one is extra.

Why are clang and GCC not using xchg to implement std::swap?

I have the following code:
char swap(char reg, char* mem) {
std::swap(reg, *mem);
return reg;
}
I expected this to compile down to:
swap(char, char*):
xchg dil, byte ptr [rsi]
mov al, dil
ret
But what it actually compiles to is (at -O3 -march=haswell -std=c++20):
swap(char, char*):
mov al, byte ptr [rsi]
mov byte ptr [rsi], dil
ret
See here for a live demo.
From the documentation of xchg, the first form should be perfectly possible:
XCHG - Exchange Register/Memory with Register
Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.
So is there any particular reason why it's not possible for the compiler to use xchg here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char but I never get an xchg in the compile output. How come?
TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.
xchg with mem has an implicit lock prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) or code-golfed hand-written Bubble Sort as part of swapping 2 memory locations.
Possibly clang -Oz could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz does do stuff like push 1 / pop rax instead of mov eax, 1 to save 2 bytes of code size.
GCC -Os won't use xchg for swaps that don't need to be atomic because -Os still cares some about speed.
Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).
See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)
I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi] load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg.
So is there any particular reason why it's not possible for the compiler to use xchg here?
Because mov is faster than xchg and compilers optimize for speed.
See:
Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?
Use xchg for -Os
Bug 47949 - Missed optimization for -Os using xchg instead of mov

C/ะก++. Why could a simple integer addition on a volatile be translated to a different asm instruction on gcc and clang?

I wrote a simple loop:
int volatile value = 0;
void loop(int limit) {
for (int i = 0; i < limit; ++i) {
++value;
}
}
I compiled this with gcc and clang(-O3 -fno-unroll-loops) and got different outputs. They differ in ++value part:
clang:
add dword ptr [rip + value], 1 # ++value
add edi, -1 # --limit
jne .LBB0_1 # if limit > 0 then continue looping
gcc:
mov eax, DWORD PTR value[rip] # copy value to a register
add edx, 1 # ++i
add eax, 1 # increment a copy of value
mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
cmp edi, edx # compare i < limit
jne .L3 # if i < limit then continue looping
C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP)
So my questions are:
1) Is gcc doing something wrong? What is the point of copying the value?
2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1, 13% on mov DWORD PTR value[rip], eax and 13% on cmp edi, edx. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?
3) Why can performance differ on gcc/clang in such a primitive code?
This is all because you used volatile and GCC doesn't optimize it as aggressively
Without volatile, e.g. for a single ++*int_ptr, you get a memory-destination add. (And hopefully not inc when tuning for Intel CPUs; inc reg is fine but inc mem costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem with -march=skylake: https://godbolt.org/z/_1Ri20)
clang knows that it can fold the volatile read / write accesses into the load and store portions of a memory-destination add.
GCC does not know how to do this optimization for volatile. Using volatile in GCC typically results in separate mov loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.
TL:DR: different compiler internals around volatile, specifically a GCC missed-optimization.
This missed optimization barely matter because volatile is rarely used. But feel free to report it on GCC's bugzilla if you want.
Without volatile, the loop would of course optimize away. But you can see a single memory-destination add from GCC or clang for a function that just does ++*p.
1) Is gcc doing something wrong? What is the point of copying the value?
It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.
Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1, but it would use dec with -march=skylake or something with efficient dec, i.e. not Silvermont).
GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz can macro-fuse into a single uop). IDK why it compiles it naively like that.
73% of time is wasted on instruction add edx, 1
perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.
add edx,1 is waiting for the reload of value. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.
(Whether it's between the multiple uops of a memory-destination add or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play:
Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )
Why other addition and move instructions take less than 1% of the time?
Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.
3) Why can performance differ on gcc/clang in such a primitive code?
I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?

Why might a C++ compiler duplicate a function exit basic block?

Consider the following snippet of code:
int* find_ptr(int* mem, int sz, int val) {
for (int i = 0; i < sz; i++) {
if (mem[i] == val) {
return &mem[i];
}
}
return nullptr;
}
GCC on -O3 compiles this to:
find_ptr(int*, int, int):
mov rax, rdi
test esi, esi
jle .L4 # why not .L8?
lea ecx, [rsi-1]
lea rcx, [rdi+4+rcx*4]
jmp .L3
.L9:
add rax, 4
cmp rax, rcx
je .L8
.L3:
cmp DWORD PTR [rax], edx
jne .L9
ret
.L8:
xor eax, eax
ret
.L4:
xor eax, eax
ret
In this assembly, the blocks with labels .L4 and .L8 are identical. Would it not be better to rewrite jumps to .L4 to .L8 and drop .L4? I thought this might be a bug, but clang also duplicates the xor-ret sequence back to back. However, ICC and MSVC each take a pretty different approach.
Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?
This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.
But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret that gcc conditionally branches to, instead of branching to a ret in another existing path. (x86 doesn't have a conditional ret, so simple functions that don't need any stack cleanup often just need to branch to a ret.
Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)
CPUs (since Pentium Pro if not earlier) have a return-address predictor stack that easily predicts the branch target for ret instructions, so there's not going to be an effect from one ret instruction more often returning to one caller and another ret more often returning to another caller. It doesn't help branch prediction to separate them and let them use different entries.
IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).
Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc
(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP. A fix was attempted, but reverted; for now it's still open. In this case it seems to be caused by -mavx, perhaps some interaction with return paths that need vzeroupper or not.)
Cause:
You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for loops into if(sz>0) { do{}while(); } if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.
I don't know why gcc fails to notice and merge two identical basic blocks that end with ret.
Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?
You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.
Interesting discovery that it only happens with -mavx (enabled by -march=skylake or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr or strlen. So IDK why AVX even makes a difference at all.
(Note that the C abstract machine never reads mem[i] beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int before an unmapped page, and sz=1000, as long as *mem == val. So to auto-vectorize without int mem[static sz] guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz] would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)

Why does this function push RAX to the stack as the first operation?

In the assembly of the C++ source below. Why is RAX pushed to the stack?
RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?
The code:-
#include <functional>
void f(std::function<void()> a)
{
a();
}
Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:
f(std::function<void ()>): # #f(std::function<void ()>)
push rax
cmp qword ptr [rdi + 16], 0
je .LBB0_1
add rsp, 8
jmp qword ptr [rdi + 24] # TAILCALL
.LBB0_1:
call std::__throw_bad_function_call()
I'm sure the reason is obvious, but I'm struggling to figure it out.
Here's a tailcall without the std::function<void()> wrapper for comparison:
void g(void(*a)())
{
a();
}
The trivial:
g(void (*)()): # #g(void (*)())
jmp rdi # TAILCALL
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.
call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.
(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).
The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is
passed on stack) byte boundary. In other words, the value (%rsp + 8) is always
a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.
As #CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
In other cases, clang typically fixes up the stack before returning with a pop rcx.
Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)
When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)
A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx can be more dangerous:
The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.
It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.
It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)
pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.