Related
I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
with GCC 12 -O3:
add2bytes(unsigned char*, unsigned char*):
movzx eax, BYTE PTR [rsi]
add al, BYTE PTR [rdi]
movzx eax, al
ret
If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.
with clang 14 -O3, it's even more weird:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1
GCC generates both sensible and strange movzx, and Clang's use of mov r8 m and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.
The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.
Do the compilers really use movzx poorly here, or am I missing something?
Edit: I have opened bug reports for Clang and GCC:
https://github.com/llvm/llvm-project/issues/56498
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Temporary workarounds using inline assembly:
https://godbolt.org/z/7qob8G3j7
#define addb(a, b) asm (\
"addb %1, %b0"\
: "+r"(a) : "mi"(b))
int add2bytes(uint8_t* a, uint8_t* b) {
int ret = *a;
addb(ret, *b);
return ret;
}
Now Clang -O3 produces:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
movzx eax, byte ptr [rdi]
add al, byte ptr [rsi]
ret
Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).
The optimal asm is what you suggest, movzx load, then byte add, leaving a uint8_t result in the low byte, correctly zero-extended to int as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)
A movzx is necessary somewhere, but it can be in the initial load. (A movzx is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx right after.)
There are basically three relevant behaviours here, among x86-64 CPUs.
Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)
Returning an int when the last instruction wrote al will likely lead to a penalty when the caller reads EAX. But mov al, [rdi] can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.
mov al, [rdi] still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al result with the high bytes of RAX from movzx eax, [rdi]) will get inserted just as cheaply as if we'd put a movzx eax, al in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.
mov al, [rdi] sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.
Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.
GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx to avoid a false dep writing a partial reg. And a final movzx to avoid a partial-register stall in the caller.
But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov-elimination on movzx r32, r8, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi on a function arg, for example.
movzx edx, byte ptr [rdi]
add dl, [rsi]
movzx eax, dl # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).
If optimizing specifically for Core2 / Nehalem, you'd do this:
xor eax, eax # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
mov al, [rdi]
add al, [rsi]
That's not bad on later CPUs, although the mov al, [rdi] would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx if you pick different registers.
GCC's choice to use movzx because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32 is to make code for old 32-bit-only CPUs.)
It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march / -mtune= k8 through znver3, or silvermont-family, or sandybridge or newer.
(Also note that some choices which should differ based on -mtune setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2 still doesn't know to avoid partial-register stalls!)
Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).
Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx, then realized at the end it needed to zero-extend the result.
An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.
Probably in general better to start doing narrow loads with movzx so that's more normally the case.
You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues
Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)
See also:
Why doesn't GCC use partial registers?
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
https://agner.org/optimize/ (especially his microarch guide re: partial-register details for P6-family CPUs. Last I looked, the guide incorrectly said Haswell doesn't have zero-latency movzx eax, dl, and that AH-merging was free; see my Q&A about HSW/SKL. But Agner's guide is accurate AFAIK for earlier CPUs.)
https://uops.info/ (front-end vs. back-end vs. latency costs for different instructions)
What is the best way to set a register to zero in x86 assembly: xor, mov or and? - including the part about avoiding partial-register stalls on P6, how xor eax,eax sets some kind of internal EAX=AL flag.
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me.
clang's mov ecx, edx zero-extension from 32 to 64 instead of from 8 to 64 is because it depends on an unofficial extension to the x86-64 SysV calling convention, that narrow args are extended to 32-bit. AMD Zen CPUs can do mov-elimination on mov ecx, edx but not for movzx-byte, so this is actually more efficient, as well as saving code-size.
(GCC and clang both make callers that respect this unofficial calling-convention feature, but only clang makes callees that depend on it. ICC doesn't do either so is not ABI-compatible with clang.)
Extension to intptr_t is of course necessary for all narrower args if you're going to index an array with one. (In abstract C terms, this is just part of using the value for pointer math). High garbage is allowed in at least the high 32 bits of the 64-bit register.
The clang bit actually seems reasonable. You get a partial register stall if you write to al and then read from eax. Using movzx breaks this partial register stall.
The initial mov to al has no dependencies on existing values of eax (due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).
In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.
[EDIT]
As for GCC, it seems a particularly bad choice. If it had chosen bl as the temporary register, that last movzx would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.
The final MOVZX is mandated by the fact that the function returns an int, extended from a byte. It must be there in the clang version, but with gcc one is extra.
This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?
#include <inttypes.h>
union pack {
struct {
int32_t i32;
int16_t i16;
};
void *ptr;
};
static_assert(sizeof(pack)==8, "what?");
void *bad(const int32_t i32, const int16_t i16) {
pack p;
p.i32 = i32;
p.i16 = i16;
return p.ptr;
}
void *good(const int32_t i32, const int16_t i16) {
pack p;
p.i16 = i16;
p.i32 = i32;
return p.ptr;
}
...
bad(int, short):
movzx eax, si
sal rax, 32
mov rsi, rax
mov eax, edi
or rax, rsi
ret
good(int, short):
movzx eax, si
mov edi, edi
sal rax, 32
or rax, rdi
ret
The compiler flags were -O3 -fno-rtti -std=c++14
This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)
mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)
I have the following code:
char swap(char reg, char* mem) {
std::swap(reg, *mem);
return reg;
}
I expected this to compile down to:
swap(char, char*):
xchg dil, byte ptr [rsi]
mov al, dil
ret
But what it actually compiles to is (at -O3 -march=haswell -std=c++20):
swap(char, char*):
mov al, byte ptr [rsi]
mov byte ptr [rsi], dil
ret
See here for a live demo.
From the documentation of xchg, the first form should be perfectly possible:
XCHG - Exchange Register/Memory with Register
Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.
So is there any particular reason why it's not possible for the compiler to use xchg here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char but I never get an xchg in the compile output. How come?
TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.
xchg with mem has an implicit lock prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) or code-golfed hand-written Bubble Sort as part of swapping 2 memory locations.
Possibly clang -Oz could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz does do stuff like push 1 / pop rax instead of mov eax, 1 to save 2 bytes of code size.
GCC -Os won't use xchg for swaps that don't need to be atomic because -Os still cares some about speed.
Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).
See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)
I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi] load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg.
So is there any particular reason why it's not possible for the compiler to use xchg here?
Because mov is faster than xchg and compilers optimize for speed.
See:
Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?
Use xchg for -Os
Bug 47949 - Missed optimization for -Os using xchg instead of mov
volatile int num = 0;
num = num + 10;
The above C++ Code seems to produce following code in intel assembly:
mov DWORD PTR [rbp-4], 0
mov eax, DWORD PTR [rbp-4]
add eax, 10
mov DWORD PTR [rbp-4], eax
If I change C++ code to
volatile int num = 0;
num = num + 0;
why will not compiler produce assembly code as:
mov DWORD PTR [rbp-4], 0
mov eax, DWORD PTR [rbp-4]
add eax, 0
mov DWORD PTR [rbp-4], eax
gcc7.2 -O0 leaves out the add eax, 0, but all the other instructions are the same (Godbolt).
At which part of compilation process does this kind trivial code gets removed. Is there any compiler flag which will make GCC compiler to not do these kind of optimizations.
clang will emit add eax, 0 at -O0, but none of gcc, ICC, nor MSVC will. See below.
gcc -O0 doesn't mean "no optimization". gcc doesn't have a "braindead literal translation" mode where it tries to transliterate every component of every C expression directly to an asm instruction.
GCC's -O0 is not intended to be totally un-optimized. It's intended to be "compile-fast" and make debugging give the expected results (even if you modify C variables with a debugger, or jump to a different line within the function). So it spills / reloads everything around every C statement, assuming that memory can be asynchronously modified by a debugger stopped before such a block. (Interesting example of the consequences, and a more detailed explanation: Why does integer division by -1 (negative one) result in FPE?)
There isn't much demand for gcc -O0 to make even slower code (e.g. forgetting that 0 is the additive identity), so nobody has implemented an option for that. And it might even make gcc slower if that behaviour was optional. (Or maybe there is such an option but it's on by default even at -O0, because it's fast, doesn't hurt debugging, and useful. Usually people like it when their debug builds run fast enough to be usable, especially for big or real-time projects.)
As #Basile Starynkevitch explains in Disable all optimization options in GCC, gcc always transforms through its internal representations on the way to making an executable. Just doing this at all results in some kinds of optimizations.
For example, even at -O0, gcc's "divide by a constant" algorithm uses a fixed-point multiplicative inverse or a shift (for powers of 2) instead of an idiv instruction. But clang -O0 will use idiv for x /= 2.
Clang's -O0 optimizes less than gcc's in this case, too:
void foo(void) {
volatile int num = 0;
num = num + 0;
}
asm output on Godbolt for x86-64
push rbp
mov rbp, rsp
# your asm block from the question, but with 0 instead of 10
mov dword ptr [rbp - 4], 0
mov eax, dword ptr [rbp - 4]
add eax, 0
mov dword ptr [rbp - 4], eax
pop rbp
ret
As you say, gcc leaves out the useless add eax,0. ICC17 stores/reloads multiple times. MSVC is usually extremely literal in debug mode, but even it avoids emitting add eax,0.
Clang is also the only one of the 4 x86 compilers on Godbolt that will use idiv for return x/2;. The others all SAR + CMOV or whatever to implement C's signed division semantics.
As per the "as if" rule in C++, an implementation is freely allowed to do whatever it wants, provided that the observable behaviour matches the standard. Specifically, in C++17, 4.6/1 (as one example):
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
This provision is sometimes called the "as-if" rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program.
For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.
As to how to control gcc, my first suggestion would be to turn off all optimisation by using the -O0 flag. You can get more fine-tuned control by using various -f<blah> options but -O0 should be a good start.
I need to compile some assembly code in both Visual Studio and an IDE using G++ 4.6.1. The -masm=intel flag works as long as I do not reference and address any variables, which however I need to do.
I considered using intrinsics, but the compiled assembly is not optimal at all (for instance I cannot define the sse-register to be used and thus no pipe optimization is possible).
Consider these parts of code (inte style assembly):
mov ecx, dword ptr [p_pXcoords]
mov edx, dword ptr [p_pYcoords]
movhpd xmm6, qword ptr [oAvgX]
movhpd xmm7, qword ptr [oAvgY]
movlpd xmm6, qword ptr [oAvgX]
movlpd xmm7, qword ptr [oAvgY]
where p_pXcoords and p_pYcoords are doublde* arrays and function parameters, oAvgX and oAvgY simpled double values.
Another line of code is this, which is in the middle of an assembly block:
movhpd xmm6, qword ptr [oAvgY]
in other words, I need to access variables and use them within specific sse registers in the middle of the code. How can I do this with AT & T syntax, best: can I do this with a g++ compiler using the -masm flag?
Is there any way at all using one assembly code for both VS and a g++ 4.6.1 based compiler
You can certainly tell GCC which SSE register to use for each variable:
register __m128i x asm("xmm6");
But I guess VS does not support that. (I am also a little surprised you need it for decent performance. Register assignment and instruction scheduling are two of the most basic things an optimizing compiler knows. You sure you enabled optimization :-) ?)
I would probably just write two functions, one using intrinsics, and one using asm for whichever compiler does not know how to schedule instructions properly.