How to retrieve an instruction's disassembly from a MachineInstr? - llvm

I need to debug a MachineFunctionPass I'm developing. I'm targeting the x86 architecture.
How do I retrieve the target disassembly from a MachineInstr instance?
Example MachineInstr representation
dead renamable $eax = MOV32rm $ebp, 1, $noreg, 12, $noreg :: (load 4 from %fixed-stack.1)
Expected disassembly (Intel syntax)
mov eax, DWORD PTR [ebp+0x12]

It depends. In general – no, because some things are not finalized at MI level (e.g. it could contain virtual registers before RA or stack slots like in your examples before stack slot allocation), etc.

Related

VS2022 MASM giving error "'ADDR32' relocation to 'lut' invalid without /LARGEADDRESSAWARE:NO [duplicate]

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses
First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

How does the following C ++ function pass parameters to the following arm assembly function? [duplicate]

It's been a while since I last coded arm assembler and I'm a little rusty on the details. If I call a C function from arm, I only have to worry about saving r0-r3 and lr, right?
If the C function uses any other registers, is it responsible for saving those on the stack and restoring them? In other words, the compiler would generate code to do this for C functions.
For example if I use r10 in an assembler function, I don't have to push its value on the stack, or to memory, and pop/restore it after a C call, do I?
This is for arm-eabi-gcc 4.3.0.
It depends on the ABI for the platform you are compiling for. On Linux, there are two ARM ABIs; the old one and the new one. AFAIK, the new one (EABI) is in fact ARM's AAPCS. The complete EABI definitions currently live here on ARM's infocenter.
From the AAPCS, §5.1.1:
r0-r3 are the argument and scratch registers; r0-r1 are also the result registers
r4-r8 are callee-save registers
r9 might be a callee-save register or not (on some variants of AAPCS it is a special register)
r10-r11 are callee-save registers
r12-r15 are special registers
A callee-save register must be saved by the callee (in opposition to a caller-save register, where the caller saves the register); so, if this is the ABI you are using, you do not have to save r10 before calling another function (the other function is responsible for saving it).
Edit: Which compiler you are using makes no difference; gcc in particular can be configured for several different ABIs, and it can even be changed on the command line. Looking at the prologue/epilogue code it generates is not that useful, since it is tailored for each function and the compiler can use other ways of saving a register (for instance, saving it in the middle of a function).
Terminology: "callee-save" is a synonym for "non-volatile" or "call-preserved": What are callee and caller saved registers?
When making a function call, you can assume that the values in r4-r11 (except maybe r9) are still there after (call-preserved), but not for r0-r3 (call-clobbered / volatile).
32-bit ARM calling conventions are specified by AAPCSFrom the AAPCS, §5.1.1 Core registers:
r0-r3 are the argument and scratch registers; r0-r1 are also the result registers
r4-r8 are callee-save registers
r9 might be a callee-save register or not (on some variants of AAPCS it is a special register)
r10-r11 are callee-save registers
r12-r15 are special registers
From the AAPCS, §5.1.2.1 VFP register usage conventions:
s16–s31 (d8–d15, q4–q7) must be preserved
s0–s15 (d0–d7, q0–q3) and d16–d31 (q8–q15) do not need to be preserved
Original post:
arm-to-c-calling-convention-neon-registers-to-save
64-bit ARM calling conventions are specified by AAPCS64General-purpose Registers section specifies what registers need be preserved.
r0-r7 are parameter/result registers
r9-r15 are temporary registers
r19-r28 are callee-saved registers.
All others (r8, r16-r18, r29, r30, SP) have special meaning and some might be treated as temporary registers.
SIMD and Floating-Point Registers specifies Neon and floating point registers.
For 64-bit ARM, A64 (from Procedure Call Standard for the ARM 64-bit Architecture)
There are thirty-one, 64-bit, general-purpose (integer) registers visible to the A64 instruction set; these are labeled r0-r30. In a 64-bit context these registers are normally referred to using the names x0-x30; in a 32-bit context the registers are specified by using w0-w30. Additionally, a stack-pointer register, SP, can be used with a restricted number of instructions.
SP The Stack Pointer
r30 LR The Link Register
r29 FP The Frame Pointer
r19…r28 Callee-saved registers
r18 The Platform Register, if needed; otherwise a temporary register.
r17 IP1 The second intra-procedure-call temporary register (can be used
by call veneers and PLT code); at other times may be used as a
temporary register.
r16 IP0 The first intra-procedure-call scratch register (can be used by call
veneers and PLT code); at other times may be used as a
temporary register.
r9…r15 Temporary registers
r8 Indirect result location register
r0…r7 Parameter/result registers
The first eight registers, r0-r7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).
Registers r16 (IP0) and r17 (IP1) may be used by a linker as a scratch register between a routine and any subroutine it calls. They can also be used within a routine to hold intermediate values between subroutine calls.
The role of register r18 is platform specific. If a platform ABI has need of a dedicated general purpose register to carry inter-procedural state (for example, the thread context) then it should use this register for that purpose. If the platform ABI has no such requirements, then it should use r18 as an additional temporary register. The platform ABI specification must document the usage for this register.
SIMD
The ARM 64-bit architecture also has a further thirty-two registers, v0-v31, which can be used by SIMD and Floating-Point operations. The precise name of the register will change indicating the size of the access.
Note: Unlike in AArch32, in AArch64 the 128-bit and 64-bit views of a SIMD and Floating-Point register do not overlap multiple registers in a narrower view, so q1, d1 and s1 all refer to the same entry in the register bank.
The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64-bits of each value stored in v8-v15 need to be preserved; it is the responsibility of the caller to preserve larger values.
The answers of CesarB and Pavel provided quotes from AAPCS, but open issues remain. Does the callee save r9? What about r12? What about r14? Furthermore, the answers were very general, and not specific to the arm-eabi toolchain as requested. Here's a practical approach to find out which register are callee-saved and which are not.
The following C code contain an inline assembly block, that claims to modify registers r0-r12 and r14. The compiler will generate the code to save the registers required by the ABI.
void foo() {
asm volatile ( "nop" : : : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12", "r14");
}
Use the command line arm-eabi-gcc-4.7 -O2 -S -o - foo.c
and add the switches for your platform (such as -mcpu=arm7tdmi for example).
The command will print the generated assembly code on STDOUT. It may look something like this:
foo:
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
nop
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
Note, that the compiler generated code saves and restores r4-r11. The compiler does not save r0-r3, r12. That it restores r14 (alias lr) is purely accidental as I know from experience that the exit code may also load the saved lr into r0 and then do a "bx r0" instead of "bx lr". Either by adding the -mcpu=arm7tdmi -mno-thumb-interwork or by using -mcpu=cortex-m4 -mthumb we obtain slightly different assembly code that looks like this:
foo:
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
nop
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
Again, r4-r11 are saved and restored. But r14 (alias lr) is not restored.
To summarize:
r0-r3 are not callee-saved
r4-r11 are callee-saved
r12 (alias ip) is not callee-saved
r13 (alias sp) is callee-saved
r14 (alias lr) is not callee-saved
r15 (alias pc) is the program counter and is set to the value of lr prior to the function call
This holds at least for arm-eabi-gcc's default's. There are command line switches (in particular the -mabi switch) that may influence the results.
According ARM's aapcs32 and aapcs64, finally summarized to this:
online view
There is also difference at least at Cortex M3 architecture for function call and interrupt.
If an Interrupt occurs it will make automatic push R0-R3,R12,LR,PC onto Stack and when return form IRQ automatic POP. If you use other registers in IRQ routine you have to push/pop them onto Stack manually.
I don't think this automatic PUSH and POP is made for a Function call (jump instruction). If convention says R0-R3 can be used only as an argument, result or scratch registers, so there is no need to store them before function call because there shouldn't be any value used later after function return. But same as in an interrupt you have to store all other CPU registers if you use them in your function.

Why does assembly code differ depending on the disassembler I use?

I am teaching myself to debug assembly language; I am new to assembly. I have a very simple C++ program and I disassembled it 3 times using different disassemblers: GDB, otool, and godbolt.org. GDB and godbolt.org produced approximately the same amount of code (1 page in a word processor), though many lines differ. The otool -tv command produced about 14 pages of code so there are many differences with respect to the GDB and godbolt.org outputs. The assembly code is too long to post. I was expecting the assembly code outputs to be the same as each other. Why are they different and which disassembler is best?
Here is my C++ program:
#include <iostream>
int main () {
int a = 1;
int b = 2;
int c = 3;
a += b;
a = a + c;
std::cout << "Value of A is " << a << std::endl;
return 0;
}
An example of assembly differences:
GDB:
0x0000000100000f44 <+4>: sub $0x30,%rsp
0x0000000100000f48 <+8>: mov 0x10c1(%rip),%rdi # 0x100002010
0x0000000100000f4f <+15>: lea 0xfb6(%rip),%rsi
Godbolt.org:
sub rsp, 16
mov DWORD PTR [rbp-4], 1
mov DWORD PTR [rbp-8], 2
Otool -tv gave 13 more pages of code than the others so there is an obvious difference there.
The differences you are experiencing are not in the disassembled program, but rather in the syntax used to represent machine instructions.
Assembly is a very low-level language, in which there is a 1-to-1 mapping between machine instructions and mnemonics. The former are sequences of bits, possibly of variable length---as in the case of x86 architectures. This representation is directly interpreted by the CPU to carry out the work associated with the semantic of the instruction. Assembly language is a "human readable" representation of such sequences.
Basically, you can find any way to represent the same machine instruction. This is the assembly syntax.
Notoriously, for x86 architectures there exist two different syntaxes: AT&T and Intel. The output which you obtained from GBD is generated according to the AT&T syntax, while the output you got from Godbolt.org is Intel's.
Intel and AT&T syntax are very different from each other in appearance, and possibly this is why you have been thinking that the outcome is not the same. Actually, it's just a different way to represent the very same instructions.
These two "dialects" for the same architecture's assembly were born with different goals in mind. AT&T syntax was developed at AT&T labs to support the generation of programs for many different CPUs (see the book: Jeff Duntermann, Assembly Language Step-by-Step). At the time, AT&T was playing a major role in the history of computers. AT&T (Bell Labs) has been the source of Unix---its paradigm is currently (although partially) committed to by Linux---the C programming language, and many other fundamental tools that we continue to use today.
On the other hand, Intel syntax has been developed, well... by Intel for their own CPUs. Many adopters of the Intel syntax say that it is much neater when prorgamming on Intel CPUs. This might well be the case, as the syntax has been carefully crafted exactly for what the CPU supports.
While the AT&T syntax is no longer used at present days (at least, to the best of my knowledge) to write programs for CPUs other than x86, some of the "culprits" of the syntax are generated from it being more "general".
Then, which one to learn? My choice would be driven by the environment you work on. The whole Unix ecosystem (comprising Linux and Mac Os) has a toolchain (such as gas) which directly use that syntax. In the Linux kernel (and other low-level pieces of software) you will definitely find inlined assembly code in AT&T syntax to interact with the hardware. Windows systems, on the other hand, have toolchains (such as nasm) which speak the Intel syntax. While compile-time flags can ask these tools to switch to the other syntax (such as the -M flag for objdump), the habit is to adopt the "native" syntax.
With respect to the specific examples given in the question, they are "incompatible", in the sense that they refer to different portions of the disassembled code, so there is a higher degree of difference across the two.
Indeed, with respect to this GDB output:
sub $0x30, %rsp
mov 0x10c1(%rip), %rdi
lea 0xfb6(%rip), %rsi
the corresponding Intel disassembly would be:
sub rsp, 0x30
mov rdi, QWORD PTR [rip+0x10c1]
lea rsi, [rip+0xfb6]
On the other hand, with respect to the Godbolt.org output:
sub rsp, 16
mov DWORD PTR [rbp-4], 1
mov DWORD PTR [rbp-8], 2
the corresponding AT&T disassembly would be:
sub $0x10,%rsp
movl $0x1,-0x4(%rbp)
movl $0x2,-0x8(%rbp)
As you can see, the greatest difference, which might cause a lot of headaches, is related to the fact that the AT&T syntax places the source first and then the destination, while Intel syntax works the other way round.
The assembly sequences are not equivalents with different syntax, they are just different, probably due to using different compilers.
First pair:
sub $0x30,%rsp ;rsp -= 0x30
sub rsp,16 ;rsp -= 0x10
Next pair:
mov 0x10c1(%rip),%rdi ;rdi = [rip+0x10c1] (loads a value)
mov DWORD PTR [rbp-4],1 ;[rbp+4] = 1 (stores an immediate value)
Next pair:
lea 0xfb6(%rip),%rsi ;rsi = rip+0xfb6 (loads an offset)
mov DWORD PTR [rbp-8],2 ;[rbp+8] = 2 (stores an immediate value)
Both sequences are incomplete, but I don't think it matter much, as the shown sequences already show the differences.
Because there is not a 1 to 1 relationship between source code and assembly. The compiler would likely generate the same assembly for the following statements:
x = x + 1
and
x++;
both of which would be compiled to something like
add dword ptr [rdi], 1
So, when we dissassemble that, which one should it be disassembled to? x = x+1 or x++? This applies to virtually every statement of your program - if there is more than one way of expressing what happens in the source language, and the effects are the same, the compiler may choose to translate both of them to the same output. After which, you have no way of knowing which one was used.

How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

My project is compiled for 32-bit in both Windows and Linux. I have an 8-byte struct that's used just about everywhere:
struct Value {
unsigned char type;
union { // 4 bytes
unsigned long ref;
float num;
}
};
In a lot of places I need to zero out the struct, which is done like so:
#define NULL_VALUE_LITERAL {0, {0L}};
static const Value NULL_VALUE = NULL_VALUE_LITERAL;
// example of clearing a value
var = NULL_VALUE;
This however does not compile to the most efficient code in Visual Studio 2013, even with all optimizations on. What I see in the assembly is that the memory location for NULL_VALUE is being read, then written to the var. This results in two reads from memory and two writes to memory. This clearing however happens a lot, even in routines that are time-sensitive, and I'm looking to optimize.
If I set the value to NULL_VALUE_LITERAL, it's worse. The literal data, which again is all zeroes, is copied into temporary a stack value and THEN copied to the variable--even if the variable is also on the stack. So that's absurd.
There's also a common situation like this:
*pd->v1 = NULL_VALUE;
It has similar assembly code to the var=NULL_VALUE above, but it's something I can't optimize with inline assembly should I choose to go that route.
From my research the very, very fastest way to clear the memory would be something like this:
xor eax, eax
mov byte ptr [var], al
mov dword ptr [var+4], eax
Or better still, since the struct alignment means there's just junk for 3 bytes after the data type:
xor eax, eax
mov dword ptr [var], eax
mov dword ptr [var+4], eax
Can you think of any way I can get code similar to that, optimized to avoid the memory reads that are totally unnecessary?
I tried some other methods, which end up creating what I feel is overly bloated code writing a 32-bit 0 literal to the two addresses, but IIRC writing a literal to memory still isn't as fast as writing a register to memory. I'm looking to eke out any extra performance I can get.
Ideally I would also like the result to be highly readable. Your help is appreciated.
I'd recommend uint32_t or unsigned int for the union with float. long on Linux x86-64 is a 64-bit type, which is probably not what you want.
I can reproduce the missed-optimization with MSVC CL19 -Ox on the Godbolt compiler explorer for x86-32 and x86-64. Workarounds that work with CL19:
make type an unsigned int instead of char, so there's no padding in the struct, then assign from a literal {0, {0L}} instead of a static const Value object. (Then you get two mov-immediate stores: mov DWORD PTR [eax], 0 / mov DWORD PTR [eax+4], 0).
gcc also has struct-zeroing missed-optimizations with padding in structs, but not as bad as MSVC (Bug 82142). It just defeats merging into wider stores; it doesn't get gcc to create an object on the stack and copy from that.
std::memset: probably the best option, MSVC compiles it to a single 64-bit store using SSE2. xorps xmm0, xmm0 / movq QWORD PTR [mem], xmm0. (gcc -m32 -O3 compiles this memset to two mov-immediate stores.)
void arg_memset(Value *vp) {
memset(vp, 0, sizeof(gvar));
}
;; x86 (32-bit) MSVC -Ox
mov eax, DWORD PTR _vp$[esp-4]
xorps xmm0, xmm0
movq QWORD PTR [eax], xmm0
ret 0
This is what I'd choose for modern CPUs (Intel and AMD). The penalty for crossing a cache-line is low enough that it's worth saving an instruction if it doesn't happen all the time. xor-zeroing is extremely cheap (especially on Intel SnB-family).
IIRC writing a literal to memory still isn't as fast as writing a register to memory
In asm, constants embedded in the instruction are called immediate data. mov-immediate to memory is mostly fine on x86, but it's a bit bloated for code-size.
(x86-64 only): A store with a RIP-relative addressing mode and an immediate can't micro-fuse on Intel CPUs, so it's 2 fused-domain uops. (See Agner Fog's microarch pdf, and other links in the x86 tag wiki.) This means it's worth it (for front-end bandwidth) to zero a register if you're doing more than one store to a RIP-relative address. Other addressing modes do fuse, though, so it's just a code-size issue.
Related: Micro fusion and addressing modes (indexed addressing modes un-laminate on Sandybridge/Ivybridge, but Haswell and later can keep indexed stores micro-fused.) This isn't dependent on immediate vs. register source.
I think memset would be a very poor fit since this is just an 8-byte struct.
Modern compilers know what some heavily-used / important standard library functions do (memset, memcpy, etc.), and treat them like intrinsics. There's very little difference as far as optimization is concerned between a = b and memcpy(&a, &b, sizeof(a)) if they have the same type.
You might get a function call to the actual library implementation in debug mode, but debug mode is very slow anyway. If you have debug-mode perf requirements, that's unusual. (But does happen for code that needs to keep up with something else...)

Why would a compiler generate this assembly?

While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.