Why might a C++ compiler duplicate a function exit basic block? - c++

Consider the following snippet of code:
int* find_ptr(int* mem, int sz, int val) {
for (int i = 0; i < sz; i++) {
if (mem[i] == val) {
return &mem[i];
}
}
return nullptr;
}
GCC on -O3 compiles this to:
find_ptr(int*, int, int):
mov rax, rdi
test esi, esi
jle .L4 # why not .L8?
lea ecx, [rsi-1]
lea rcx, [rdi+4+rcx*4]
jmp .L3
.L9:
add rax, 4
cmp rax, rcx
je .L8
.L3:
cmp DWORD PTR [rax], edx
jne .L9
ret
.L8:
xor eax, eax
ret
.L4:
xor eax, eax
ret
In this assembly, the blocks with labels .L4 and .L8 are identical. Would it not be better to rewrite jumps to .L4 to .L8 and drop .L4? I thought this might be a bug, but clang also duplicates the xor-ret sequence back to back. However, ICC and MSVC each take a pretty different approach.
Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?

This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.
But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret that gcc conditionally branches to, instead of branching to a ret in another existing path. (x86 doesn't have a conditional ret, so simple functions that don't need any stack cleanup often just need to branch to a ret.
Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)
CPUs (since Pentium Pro if not earlier) have a return-address predictor stack that easily predicts the branch target for ret instructions, so there's not going to be an effect from one ret instruction more often returning to one caller and another ret more often returning to another caller. It doesn't help branch prediction to separate them and let them use different entries.
IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).
Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc
(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP. A fix was attempted, but reverted; for now it's still open. In this case it seems to be caused by -mavx, perhaps some interaction with return paths that need vzeroupper or not.)
Cause:
You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for loops into if(sz>0) { do{}while(); } if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.
I don't know why gcc fails to notice and merge two identical basic blocks that end with ret.
Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?
You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.
Interesting discovery that it only happens with -mavx (enabled by -march=skylake or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr or strlen. So IDK why AVX even makes a difference at all.
(Note that the C abstract machine never reads mem[i] beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int before an unmapped page, and sz=1000, as long as *mem == val. So to auto-vectorize without int mem[static sz] guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz] would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)

Related

imul then mov vs mov then imul - any difference?

If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.
Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).
CPU with mov-elimination
mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?
All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)
But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

Strange uses of movzx by Clang and GCC

I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
with GCC 12 -O3:
add2bytes(unsigned char*, unsigned char*):
movzx eax, BYTE PTR [rsi]
add al, BYTE PTR [rdi]
movzx eax, al
ret
If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.
with clang 14 -O3, it's even more weird:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1
GCC generates both sensible and strange movzx, and Clang's use of mov r8 m and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.
The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.
Do the compilers really use movzx poorly here, or am I missing something?
Edit: I have opened bug reports for Clang and GCC:
https://github.com/llvm/llvm-project/issues/56498
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Temporary workarounds using inline assembly:
https://godbolt.org/z/7qob8G3j7
#define addb(a, b) asm (\
"addb %1, %b0"\
: "+r"(a) : "mi"(b))
int add2bytes(uint8_t* a, uint8_t* b) {
int ret = *a;
addb(ret, *b);
return ret;
}
Now Clang -O3 produces:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
movzx eax, byte ptr [rdi]
add al, byte ptr [rsi]
ret
Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).
The optimal asm is what you suggest, movzx load, then byte add, leaving a uint8_t result in the low byte, correctly zero-extended to int as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)
A movzx is necessary somewhere, but it can be in the initial load. (A movzx is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx right after.)
There are basically three relevant behaviours here, among x86-64 CPUs.
Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)
Returning an int when the last instruction wrote al will likely lead to a penalty when the caller reads EAX. But mov al, [rdi] can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.
mov al, [rdi] still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al result with the high bytes of RAX from movzx eax, [rdi]) will get inserted just as cheaply as if we'd put a movzx eax, al in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.
mov al, [rdi] sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.
Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.
GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx to avoid a false dep writing a partial reg. And a final movzx to avoid a partial-register stall in the caller.
But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov-elimination on movzx r32, r8, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi on a function arg, for example.
movzx edx, byte ptr [rdi]
add dl, [rsi]
movzx eax, dl # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).
If optimizing specifically for Core2 / Nehalem, you'd do this:
xor eax, eax # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
mov al, [rdi]
add al, [rsi]
That's not bad on later CPUs, although the mov al, [rdi] would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx if you pick different registers.
GCC's choice to use movzx because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32 is to make code for old 32-bit-only CPUs.)
It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march / -mtune= k8 through znver3, or silvermont-family, or sandybridge or newer.
(Also note that some choices which should differ based on -mtune setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2 still doesn't know to avoid partial-register stalls!)
Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).
Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx, then realized at the end it needed to zero-extend the result.
An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.
Probably in general better to start doing narrow loads with movzx so that's more normally the case.
You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues
Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)
See also:
Why doesn't GCC use partial registers?
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
https://agner.org/optimize/ (especially his microarch guide re: partial-register details for P6-family CPUs. Last I looked, the guide incorrectly said Haswell doesn't have zero-latency movzx eax, dl, and that AH-merging was free; see my Q&A about HSW/SKL. But Agner's guide is accurate AFAIK for earlier CPUs.)
https://uops.info/ (front-end vs. back-end vs. latency costs for different instructions)
What is the best way to set a register to zero in x86 assembly: xor, mov or and? - including the part about avoiding partial-register stalls on P6, how xor eax,eax sets some kind of internal EAX=AL flag.
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me.
clang's mov ecx, edx zero-extension from 32 to 64 instead of from 8 to 64 is because it depends on an unofficial extension to the x86-64 SysV calling convention, that narrow args are extended to 32-bit. AMD Zen CPUs can do mov-elimination on mov ecx, edx but not for movzx-byte, so this is actually more efficient, as well as saving code-size.
(GCC and clang both make callers that respect this unofficial calling-convention feature, but only clang makes callees that depend on it. ICC doesn't do either so is not ABI-compatible with clang.)
Extension to intptr_t is of course necessary for all narrower args if you're going to index an array with one. (In abstract C terms, this is just part of using the value for pointer math). High garbage is allowed in at least the high 32 bits of the 64-bit register.
The clang bit actually seems reasonable. You get a partial register stall if you write to al and then read from eax. Using movzx breaks this partial register stall.
The initial mov to al has no dependencies on existing values of eax (due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).
In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.
[EDIT]
As for GCC, it seems a particularly bad choice. If it had chosen bl as the temporary register, that last movzx would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.
The final MOVZX is mandated by the fact that the function returns an int, extended from a byte. It must be there in the clang version, but with gcc one is extra.

Why modulo operation mov edx to eax and then do the reverse in X86-64? [duplicate]

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
loading the two floats onto registers xmm0 and xmm1
put them in the stack
load one value (not the one xmm0 had earlier) from the stack to xmm0
perform the addition.
store the result back to the stack.
I find it inefficient because:
Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.
Given that the compiler is always right, why did it choose this strategy?
-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.
Getting interesting compiler output
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3 (on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
Related:
Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.

Why does assembler move register value on stack and then back to register? [duplicate]

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
loading the two floats onto registers xmm0 and xmm1
put them in the stack
load one value (not the one xmm0 had earlier) from the stack to xmm0
perform the addition.
store the result back to the stack.
I find it inefficient because:
Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.
Given that the compiler is always right, why did it choose this strategy?
-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.
Getting interesting compiler output
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3 (on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
Related:
Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.

C/ะก++. Why could a simple integer addition on a volatile be translated to a different asm instruction on gcc and clang?

I wrote a simple loop:
int volatile value = 0;
void loop(int limit) {
for (int i = 0; i < limit; ++i) {
++value;
}
}
I compiled this with gcc and clang(-O3 -fno-unroll-loops) and got different outputs. They differ in ++value part:
clang:
add dword ptr [rip + value], 1 # ++value
add edi, -1 # --limit
jne .LBB0_1 # if limit > 0 then continue looping
gcc:
mov eax, DWORD PTR value[rip] # copy value to a register
add edx, 1 # ++i
add eax, 1 # increment a copy of value
mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
cmp edi, edx # compare i < limit
jne .L3 # if i < limit then continue looping
C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP)
So my questions are:
1) Is gcc doing something wrong? What is the point of copying the value?
2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1, 13% on mov DWORD PTR value[rip], eax and 13% on cmp edi, edx. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?
3) Why can performance differ on gcc/clang in such a primitive code?
This is all because you used volatile and GCC doesn't optimize it as aggressively
Without volatile, e.g. for a single ++*int_ptr, you get a memory-destination add. (And hopefully not inc when tuning for Intel CPUs; inc reg is fine but inc mem costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem with -march=skylake: https://godbolt.org/z/_1Ri20)
clang knows that it can fold the volatile read / write accesses into the load and store portions of a memory-destination add.
GCC does not know how to do this optimization for volatile. Using volatile in GCC typically results in separate mov loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.
TL:DR: different compiler internals around volatile, specifically a GCC missed-optimization.
This missed optimization barely matter because volatile is rarely used. But feel free to report it on GCC's bugzilla if you want.
Without volatile, the loop would of course optimize away. But you can see a single memory-destination add from GCC or clang for a function that just does ++*p.
1) Is gcc doing something wrong? What is the point of copying the value?
It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.
Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1, but it would use dec with -march=skylake or something with efficient dec, i.e. not Silvermont).
GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz can macro-fuse into a single uop). IDK why it compiles it naively like that.
73% of time is wasted on instruction add edx, 1
perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.
add edx,1 is waiting for the reload of value. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.
(Whether it's between the multiple uops of a memory-destination add or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play:
Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )
Why other addition and move instructions take less than 1% of the time?
Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.
3) Why can performance differ on gcc/clang in such a primitive code?
I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?