How to tell clang not to save registers to stack? - llvm

The Goal
I'm currently trying out avr-llvm (a llvm that supports AVR as a target). My main goal is to use it's hopefully better optimizer (compared to the one of gcc) to achieve smaller binaries. If you know a little about AVRs you know that you've got only few memory.
I currently work with an ATTiny45, 4KB Flash and 256 Bytes (just bytes not KB!) of SRAM.
The Problem
I was trying compile a simple C program (see below), to check what assembly code is produced and how the machine-code size is developing. I used "clang -Oz -S test.c" to produce assembly output and to optimize it for minimal size. My problem are the needlessly saved register values, knowing that this method would never return.
My Questions...
How can I tell llvm that it can just clobber any register, if needed without saving/restoring it's content? Any ideas how to optimize it even more (e.g. more efficient setup of stack)?
Details / Example
Here is my test program. As mentioned above it was compiled using "clang -Oz -S test.c".
#include <stdint.h>
void __attribute__ ((noreturn)) main() {
volatile uint8_t res = 1;
while (1) {}
}
As you can see it has just one "volatile" variable of type uint8_t (if I don't set it to volatile everything would be optimized out). This variable is set to 1. And there is an endless loop at the end. Now let us have a look at the assembly output:
.file "test.c"
.text
.globl main
.align 2
.type main,#function
main:
push r28
push r29
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
.tmp0:
.size main, .tmp0-main
Yeah! That's a lot of machine code for such a simple program. I just tested some variations and had a look into the reference manual of the AVR... so I can explain what happens. Let's have a look at each part.
This here is the "beef", which is just doing what our c program is about. It loads r24 with value "1" which is stored into memory at Y+1 (Stack Pointer + 1). And there is of course our endless loop:
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
Note: that the endless loop is needed. Else the __attribute__ ((noreturn)) is ignored and the stack pointer + saved registers are restored later.
Just before that the pointer in "Y" is set up:
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
What happens here is:
Y (register pair r28:r29 is equivalent to "Y") is loaded from ports 61 and 62, these ports map to some "registers" namely SPL and SPH ("L"ow and "H"igh byte of the "S"tack "P"ointer)
the loaded value is decremented (sbiw r29:r28)
the changed value of the stack pointer is saved back to the ports; and I guess to avoid problems: interrupts are disabled before; the state of "cli/sti" [which is stored in register 63 (SREG)] is saved to r0 and later restored to port 63.
This setup of the stack registers seems to be inefficient. To increment the stack pointer I would just need to "push r0" to the stack. Then I could just load the value of SPH/SPL into r29:r28. How ever, this would probably need some changes to llvm's optimizer in source code. The above code makes just sense if more than 3 byte of stack have to be reserved for local variables (even if optimizing -O3, for -Oz it makes sense for up to 6 bytes). HOW EVER... I guess we need to touch the source of llvm for that; so this is out of scope.
More interesting is this part:
push r28
push r29
As main() is not intended to return, this doesn't make sense. This just wastes RAM and flash memory for silly instructions (remember: we have only 64, 128 or 256 bytes SRAM available in some devices).
I investigated this a bit further: If we let main return (e.g. no endless loop) the stack pointer is restored, we have a "ret" instruction at the end AND the registers r28 and r29 are restored from stack via "pop r29, pop 28". But the compiler should know, that if scope of the function "main" is never left, then all registers can be clobbered without having them stored to the stack.
This problem seems just a bit "silly" as we speak about 2 bytes RAM. But just think about what happens if the program starts using the rest of the registers.
All this really changed my view at current "compilers". I thought today there wouldn't be much room for optimization via assembler. But it seems there is...
So, still the question is...
Do you have any idea how to improve this situation (except for filing a bug report / feature request)?
I mean: Are there just some compiler switches I might have overlooked...?
Additional Info
Using __attribute__ ((OS_main)) works for avr-gcc.
Output is as following:
.file "test.c"
__SREG__ = 0x3f
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__CCP__ = 0x34
__tmp_reg__ = 0
__zero_reg__ = 1
.global __do_copy_data
.global __do_clear_bss
.text
.global main
.type main, #function
main:
push __tmp_reg__
in r28,__SP_L__
in r29,__SP_H__
/* prologue: function */
/* frame size = 1 */
ldi r24,lo8(1)
std Y+1,r24
.L2:
rjmp .L2
.size main, .-main
This is (to my opinion) optimal in size (6 instructions or 12 bytes) and also in speed for this sample program. Is there any equivalent attribute for llvm? (clang version '3.2 (trunk 160228) (based on LLVM 3.2svn)' does neither know about OS_task nor knows anything about OS_main).

The answer to the question asked is somewhat brought up by Anton in his comment: the problem is not in LLVM, it is in your AVR target. For example, here is an equivalent program run through Clang and LLVM for other targets:
% cat test.c
__attribute__((noreturn)) int main() {
volatile unsigned char res = 1;
while (1) {}
}
% ./bin/clang -c -o - -S -Oz test.c # I'm on an x86-64 machine
<snip>
main: # #main
.cfi_startproc
# BB#0: # %entry
movb $1, -1(%rsp)
.LBB0_1: # %while.body
# =>This Inner Loop Header: Depth=1
jmp .LBB0_1
.Ltmp0:
.size main, .Ltmp0-main
.cfi_endproc
% ./bin/clang -c -o - --target=armv6-unknown-linux-gnueabi -S -Oz test.c
<snip>
main:
sub sp, sp, #4
mov r0, #1
strb r0, [sp, #3]
.LBB0_1:
b .LBB0_1
.Ltmp0:
.size main, .Ltmp0-main
% ./bin/clang -c -o - --target=powerpc64-unknown-linux-gnu -S -Oz test.c
<snip>
main:
.align 3
.quad .L.main
.quad .TOC.#tocbase
.quad 0
.text
.L.main:
li 3, 1
stb 3, -9(1)
.LBB0_1:
b .LBB0_1
.long 0
.quad 0
.Ltmp0:
.size main, .Ltmp0-.L.main
As you can see for all three of these targets, the only code generated is to reserve stack space (if necessary, it isn't on x86-64) and set the value on the stack. I think this is minimal.
That said, if you do find problems with LLVM's optimizer, the best way to get help is to send email to the development mailing list or to file bugs if you have a specific input IR sequence that should produce more minimal output IR.
Finally, to answer the questions asked in comments on your question: there are actually areas where LLVM's optimizer is significantly more powerful than GCC. However, there are also areas where it is significantly less powerful. =] Benchmark the code you care about.

Related

How are assembly directives instructed?

To elaborate the question on the title, suppose I declared the following array in C++,
int myarr[10];
This disassembles to the following in x86
myarr:
.zero 40
Now, AFAIK this .zero directive is used for convention and is not an instruction. Then, how exactly is this directive translated to x86(or any other architecture, it's not the emphasis here) instructions? Because, for all we know the CPU can only execute instructions. So I guess these directives are somehow translated to instructions, am I correct?
I could generalize the question by also asking how .word .long etc. are translated into instructions, but I think it is clear.
The output of the assembler is an object module. In the object module are representations of various sections for a program. Each section has a size, some attributes, and possibly some data to be put into the section.
For example, a section may be a few thousand bytes, have attributes indicating it contains instructions for execution, and have data that consists of those instructions. Another section might be several hundred bytes but have no data—it is just space to be allocated when the program starts. Another section might be very big and have non-zero data that contains its initial values when the program starts.
To assemble a .zero 40 directive, the compiler just includes forty bytes of zeros in the section it is currently building. When it writes the final output, it will include those zeros in that section. Data directives like this and .word and such simply tell the assembler what data to put into its output.
unsigned int stuff[10];
void fun ( void )
{
unsigned int r;
for(r=0;r<10;r++) stuff[r]=r;
}
using ARM...
00000000 <fun>:
0: e3a03000 mov r3, #0
4: e59f2010 ldr r2, [pc, #16] ; 1c <fun+0x1c>
8: e5a23004 str r3, [r2, #4]!
c: e2833001 add r3, r3, #1
10: e353000a cmp r3, #10
14: 1afffffb bne 8 <fun+0x8>
18: e12fff1e bx lr
1c: 00000ffc
Disassembly of section .bss:
00001000 <stuff>:
...
The array stuff is simply data it is not code it is not instructions and won't be, the directive in question you asked about won't become code, it cants it is data.
If you want to see code, instructions, then you need to put lines of high level language that act on data for example as shown here. And in that case the compiler generates code.
Looking at this compilers actual output (comments and other non-essentials removed)
fun:
mov r3, #0
ldr r2, .L6
.L2:
str r3, [r2, #4]!
add r3, r3, #1
cmp r3, #10
bne .L2
bx lr
.L7:
.align 2
.L6:
.word stuff-4
...
.comm stuff,40,4
the .comm in this case is how they declared the data that represents the array in the high level language. and the other stuff is mostly code. the .align is there so that the address of L6 is aligned so that you don't get an alignment fault when you try to read it.
.word is a directive, what you see here is .text vs .data while it is just one simple C program with the array and the code right there next to each other. because code can possibly live in read only memory like flash and data needs to be in read/write memory and at compile time the compiler doesn't know where the data is relative to the code, so it generates an abstraction by placing a read only word in the code that the linker fills in later, the code is generic and whatever the linker puts in there it uses. The linker "places" .text and .bss in this case it wasn't initialized so it isn't actually .data and then makes that connection in the code.
labels are directives if you will so that the programmer or code generator (compiler) doesn't have to count instructions or overall size of instructions to make relative jumps. Let the tools do that for you.
1c: 00000ffc
Disassembly of section .bss:
00001000 <stuff>:
...
and based on the way I linked this (non actually a working) program stuff is the only data item in this program and the linker placed it where I asked at address 0x1000, then went back and filled in that .word directive to be stuff-4 which is 0xFFC so that the code as compiled works.
directives are not part of the instruction set but are part of the assembly language, note that assembly language is defined by the assembler, the tool, not the instruction set/target. There are countless different x86 assembly languages and AT&T vs Intel is not the primary difference, the directives how you define a label, how you indicate a number is hex or decimal, because of the vagueness of the instructions as defined in the early docs lots of adjectives if you will to be able to specify which mov instruction you were actually after and even though that's part of the instruction and not a directive those adjectives varied across assembly languages. ARM, MIPS, and many if not most others have had tools created with incompatible assembly languages. .zero for example being one of those incompatible things.
In any case an assembly language in question needs to be able to define data and then have a way for code to reference that data in order to make useful programs.
The notion of a one to one line of assembly language to instructions is very misleading and don't get fooled by it, today's compilers generate almost as much non-code as code in their output. Lots of directives and other information.

How to remove "noise" from GCC/clang assembly output?

I want to inspect the assembly output of applying boost::variant in my code in order to see which intermediate calls are optimized away.
When I compile the following example (with GCC 5.3 using g++ -O3 -std=c++14 -S), it seems as if the compiler optimizes away everything and directly returns 100:
(...)
main:
.LFB9320:
.cfi_startproc
movl $100, %eax
ret
.cfi_endproc
(...)
#include <boost/variant.hpp>
struct Foo
{
int get() { return 100; }
};
struct Bar
{
int get() { return 999; }
};
using Variant = boost::variant<Foo, Bar>;
int run(Variant v)
{
return boost::apply_visitor([](auto& x){return x.get();}, v);
}
int main()
{
Foo f;
return run(f);
}
However, the full assembly output contains much more than the above excerpt, which to me looks like it is never called. Is there a way to tell GCC/clang to remove all that "noise" and just output what is actually called when the program is ran?
full assembly output:
.file "main1.cpp"
.section .rodata.str1.8,"aMS",#progbits,1
.align 8
.LC0:
.string "/opt/boost/include/boost/variant/detail/forced_return.hpp"
.section .rodata.str1.1,"aMS",#progbits,1
.LC1:
.string "false"
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LCOLDB2:
.section .text._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LHOTB2:
.p2align 4,,15
.weak _ZN5boost6detail7variant13forced_returnIvEET_v
.type _ZN5boost6detail7variant13forced_returnIvEET_v, #function
_ZN5boost6detail7variant13forced_returnIvEET_v:
.LFB1197:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $_ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, %ecx
movl $49, %edx
movl $.LC0, %esi
movl $.LC1, %edi
call __assert_fail
.cfi_endproc
.LFE1197:
.size _ZN5boost6detail7variant13forced_returnIvEET_v, .-_ZN5boost6detail7variant13forced_returnIvEET_v
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LCOLDE2:
.section .text._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LHOTE2:
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LCOLDB3:
.section .text._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LHOTB3:
.p2align 4,,15
.weak _ZN5boost6detail7variant13forced_returnIiEET_v
.type _ZN5boost6detail7variant13forced_returnIiEET_v, #function
_ZN5boost6detail7variant13forced_returnIiEET_v:
.LFB9757:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $_ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, %ecx
movl $39, %edx
movl $.LC0, %esi
movl $.LC1, %edi
call __assert_fail
.cfi_endproc
.LFE9757:
.size _ZN5boost6detail7variant13forced_returnIiEET_v, .-_ZN5boost6detail7variant13forced_returnIiEET_v
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LCOLDE3:
.section .text._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LHOTE3:
.section .text.unlikely,"ax",#progbits
.LCOLDB4:
.text
.LHOTB4:
.p2align 4,,15
.globl _Z3runN5boost7variantI3FooJ3BarEEE
.type _Z3runN5boost7variantI3FooJ3BarEEE, #function
_Z3runN5boost7variantI3FooJ3BarEEE:
.LFB9310:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl (%rdi), %eax
cltd
xorl %edx, %eax
cmpl $19, %eax
ja .L7
jmp *.L9(,%rax,8)
.section .rodata
.align 8
.align 4
.L9:
.quad .L30
.quad .L10
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.text
.p2align 4,,10
.p2align 3
.L7:
call _ZN5boost6detail7variant13forced_returnIiEET_v
.p2align 4,,10
.p2align 3
.L30:
movl $100, %eax
.L8:
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L10:
.cfi_restore_state
movl $999, %eax
jmp .L8
.cfi_endproc
.LFE9310:
.size _Z3runN5boost7variantI3FooJ3BarEEE, .-_Z3runN5boost7variantI3FooJ3BarEEE
.section .text.unlikely
.LCOLDE4:
.text
.LHOTE4:
.globl _Z3runN5boost7variantI3FooI3BarEEE
.set _Z3runN5boost7variantI3FooI3BarEEE,_Z3runN5boost7variantI3FooJ3BarEEE
.section .text.unlikely
.LCOLDB5:
.section .text.startup,"ax",#progbits
.LHOTB5:
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB9320:
.cfi_startproc
movl $100, %eax
ret
.cfi_endproc
.LFE9320:
.size main, .-main
.section .text.unlikely
.LCOLDE5:
.section .text.startup
.LHOTE5:
.section .rodata
.align 32
.type _ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, #object
.size _ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, 58
_ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__:
.string "T boost::detail::variant::forced_return() [with T = void]"
.align 32
.type _ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, #object
.size _ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, 57
_ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__:
.string "T boost::detail::variant::forced_return() [with T = int]"
.ident "GCC: (Ubuntu 5.3.0-3ubuntu1~14.04) 5.3.0 20151204"
.section .note.GNU-stack,"",#progbits
Stripping out the .cfi directives, unused labels, and comment lines is a solved problem: the scripts behind Matt Godbolt's compiler explorer are open source on its github project. It can even do colour highlighting to match source lines to asm lines (using the debug info).
You can set it up locally so you can feed it files that are part of your project with all the #include paths and so on (using -I/...). And so you can use it on private source code that you don't want to send out over the Internet.
Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” shows how to use it (it's pretty self-explanatory but has some neat features if you read the docs on github), and also how to read x86 asm, with a gentle introduction to x86 asm itself for total beginners, and to looking at compiler output. He goes on to show some neat compiler optimizations (e.g. for dividing by a constant), and what kind of functions give useful asm output for looking at optimized compiler output (function args, not int a = 123;).
On the Godbolt compiler explorer, it can be useful to use -g0 -fno-asynchronous-unwind-tables if you want to uncheck the filter option for directives, e.g. because you want to see the .section and .p2align stuff in the compiler output. The default is to add -g to your options to get the debug info it uses to colour-highlight matching source and asm lines, but this means .cfi directives for every stack operation, and .loc for every source line, among other things.
With plain gcc/clang (not g++), -fno-asynchronous-unwind-tables avoids .cfi directives. Possibly also useful: -fno-exceptions -fno-rtti -masm=intel. Make sure to omit -g.
Copy/paste this for local use:
g++ -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -fverbose-asm \
-Wall -Wextra foo.cpp -O3 -masm=intel -S -o- | less
Or -Os can be more readable, e.g. using div for division by non-power-of-2 constants instead of a multiplicative inverse even though that's a lot worse for performance and only a bit smaller, if at all.
But really, I'd recommend just using Godbolt directly (online or set it up locally)! You can quickly flip between versions of gcc and clang to see if old or new compilers do something dumb. (Or what ICC does, or even what MSVC does.) There's even ARM / ARM64 gcc 6.3, and various gcc for PowerPC, MIPS, AVR, MSP430. (It can be interesting to see what happens on a machine where int is wider than a register, or isn't 32-bit. Or on a RISC vs. x86).
For C instead of C++, you can use -xc -std=gnu11 to avoid flipping the language drop-down to C, which resets your source pane and compiler choices, and has a different set of compilers available.
Useful compiler options for making asm for human consumption:
Remember, your code only has to compile, not link: passing a pointer to an external function like void ext(void*p) is a good way to stop something from optimizing away. You only need a prototype for it, with no definition so the compiler can't inline it or make any assumptions about what it does. (Or inline asm like Benchmark::DoNotOptimize can force a compiler to materialize a value in a register, or forget about it being a known constant, if you know GNU C inline asm syntax well enough to use constraints to understand the effect you're having on what you're requiring of the compiler.)
I'd recommend using -O3 -Wall -Wextra -fverbose-asm -march=haswell for looking at code. (-fverbose-asm can just make the source look noisy, though, when all you get are numbered temporaries as names for the operands.) When you're fiddling with the source to see how it changes the asm, you definitely want compiler warnings enabled. You don't want to waste time scratching your head over the asm when the explanation is that you did something that deserves a warning in the source.
To see how the calling convention works, you often want to look at caller and callee without inlining.
You can use __attribute__((noipa)) foo_t foo(bar_t x) { ... } on a definition, or compile with gcc -O3 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions to disable inlining. (But those command line options don't disable cloning a function for constant-propagation. noipa = no Inter-Procedural Analysis. It's even stronger than __attribute__((noinline,noclone)).) See From compiler perspective, how is reference for array dealt with, and, why passing by value(not decay) is not allowed? for an example.
Or if you just want to see how functions pass / receive args of different types, you could use different names but the same prototype so the compiler doesn't have a definition to inline. This works with any compiler. Without a definition, a function is just a black box to the optimizer, governed only by the calling convention / ABI.
-ffast-math will get many libm functions to inline, some to a single instruction (esp. with SSE4 available for roundsd). Some will inline with just -fno-math-errno, or other "safer" parts of -ffast-math, without the parts that allow the compiler to round differently. If you have FP code, definitely look at it with/without -ffast-math. If you can't safely enable any of -ffast-math in your regular build, maybe you'll get an idea for a safe change you can make in the source to allow the same optimization without -ffast-math.
-O3 -fno-tree-vectorize will optimize without auto-vectorizing, so you can get full optimization without if you want to compare with -O2 (which doesn't enable autovectorization on gcc11 and earlier, but does on all clang).
-Os (optimize for size and speed) can be helpful to keep the code more compact, which means less code to understand. clang's -Oz optimizes for size even when it hurts speed, even using push 1 / pop rax instead of mov eax, 1, so that's only interesting for code golf.
Even -Og (minimal optimization) might be what you want to look at, depending on your goals. -O0 is full of store/reload noise, which makes it harder to follow, unless you use register vars. The only upside is that each C statement compiles to a separate block of instructions, and it makes -fverbose-asm able to use the actual C var names.
clang unrolls loops by default, so -fno-unroll-loops can be useful in complex functions. You can get a sense of "what the compiler did" without having to wade through the unrolled loops. (gcc enables -funroll-loops with -fprofile-use, but not with -O3). (This is a suggestion for human-readable code, not for code that would run faster.)
Definitely enable some level of optimization, unless you specifically want to know what -O0 did. Its "predictable debug behaviour" requirement makes the compiler store/reload everything between every C statement, so you can modify C variables with a debugger and even "jump" to a different source line within the same function, and have execution continue as if you did that in the C source. -O0 output is so noisy with stores/reloads (and so slow) not just from lack of optimization, but forced de-optimization to support debugging. (also related).
To get a mix of source and asm, use gcc -Wa,-adhln -c -g foo.c | less to pass extra options to as. (More discussion of this in a blog post, and another blog.). Note that the output of this isn't valid assembler input, because the C source is there directly, not as an assembler comment. So don't call it a .s. A .lst might make sense if you want to save it to a file.
Godbolt's color highlighting serves a similar purpose, and is great at helping you see when multiple non-contiguous asm instructions come from the same source line. I haven't used that gcc listing command at all, so IDK how well it does, and how easy it is for the eye to see, in that case.
I like the high code density of godbolt's asm pane, so I don't think I'd like having source lines mixed in. At least not for simple functions. Maybe with a function that was too complex to get a handle on the overall structure of what the asm does...
And remember, when you want to just look at the asm, leave out the main() and the compile-time constants. You want to see the code for dealing with a function arg in a register, not for the code after constant-propagation turns it into return 42, or at least optimizes away some stuff.
Removing static and/or inline from functions will produce a stand-alone definition for them, as well as a definition for any callers, so you can just look at that.
Don't put your code in a function called main(). gcc knows that main is special and assumes it will only be called once, so it marks it as "cold" and optimizes it less.
The other thing you can do: If you did make a main(), you can run it and use a debugger. stepi (si) steps by instruction. See the bottom of the x86 tag wiki for instructions. But remember that code might optimize away after inlining into main with compile-time-constant args.
__attribute__((noinline)) may help, on a function that you want to not be inlined. gcc will also make constant-propagation clones of functions, i.e. a special version with one of the args as a constant, for call-sites that know they're passing a constant. The symbol name will be .clone.foo.constprop_1234 or something in the asm output. You can use __attribute__((noclone)) to disable that, too.).
For example
If you want to see how the compiler multiplies two integers: I put the following code on the Godbolt compiler explorer to get the asm (from gcc -O3 -march=haswell -fverbose-asm) for the wrong way and the right way to test this.
// the wrong way, which people often write when they're used to creating a runnable test-case with a main() and a printf
// or worse, people will actually look at the asm for such a main()
int constants() { int a = 10, b = 20; return a * b; }
mov eax, 200 #,
ret # compiles the same as return 200; not interesting
// the right way: compiler doesn't know anything about the inputs
// so we get asm like what would happen when this inlines into a bigger function.
int variables(int a, int b) { return a * b; }
mov eax, edi # D.2345, a
imul eax, esi # D.2345, b
ret
(This mix of asm and C was hand-crafted by copy-pasting the asm output from godbolt into the right place. I find it's a good way to show how a short function compiles in SO answers / compiler bug reports / emails.)
You can always look at the generated assembly from the object file, instead of using the compilers assembly output. objdump comes to mind.
You can even tell objdump to intermix source with assembly, making it easier to figure out what source line corresponds to what instructions. Example session:
$ cat test.cc
int foo(int arg)
{
return arg * 42;
}
$ g++ -g -O3 -std=c++14 -c test.cc -o test.o && objdump -dS -M intel test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z3fooi>:
int foo(int arg)
{
return arg + 1;
0: 8d 47 01 lea eax,[rdi+0x1]
}
3: c3 ret
Explanation of objdump flags:
-d disassembles all executable sections
-S intermixes assembly with source (-g required while compiling with g++)
-M intel choses intel syntax over ugly AT&T syntax (optional)
I like to insert labels that I can easily grep out of the objdump output.
int main() {
asm volatile ("interesting_part_begin%=:":);
do_something();
asm volatile ("interesting_part_end%=:":);
}
I haven't had a problem with this yet, but asm volatile can be very hard on a compiler's optimizer because it tends to leave such code untouched.

Where is the one to one correlation between the assembly and cpp code?

I tried to examine how the this code will be in assembly:
int main(){
if (0){
int x = 2;
x++;
}
return 0;
}
I was wondering what does if (0) mean?
I used the shell command g++ -S helloWorld.cpp in Linux
and got this code:
.file "helloWorld.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
I expected that the assembly will contain some JZ but where is it?
How can I compile the code without optimization?
There is no direct, guaranteed relationship between C++ source code and
the generated assembler. The C++ source code defines a certain
semantics, and the compiler outputs machine code which will implement
the observable behavior of those semantics. How the compiler does this,
and the actual code it outputs, can vary enormously, even over the same
underlying hardware; I would be very disappointed in a compiler which
generated code which compared 0 with 0, and then did a conditional
jump if the results were equal, regardless of what the C++ source code
was.
In your example, the only observable behavior in your code is to return
0 to the OS. Anything the compiler generates must do this (and have
no other observable behavior). The code you show isn't optimal for
this:
xorl %eax, %eax
ret
is really all that is needed. But of course, the compiler is free to
generate a lot more if it wants. (Your code, for example, sets up a
frame to support local variables, even though there aren't any. Many
compilers do this systematically, because most debuggers expect it, and
get confused if there is no frame.)
With regards to optimization, this depends on the compiler. With g++,
-O0 (that's the letter O followed by the number zero) turns off all
optimization. This is the default, however, so it is effectively what
you are seeing. In addition to having several different levels of
optimization, g++ supports turning individual optimizations off or on.
You might want to look at the complete list:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Optimize-Options.html#Optimize-Options.
The compiler eliminates that code as dead code, e.g. code that will never run. What you're left with is establishing the stack frame and setting the return value of the function. if(0) is never true, after all. If you want to get JZ, then you should probably do something like if(variable == 0). Keep in mind that the compiler is in no way required to actually emit the JZ instruction, it may use any other means to achieve the same thing. Compiling a high level language to assembly is very rarely a clear, one-to-one correlation.
The code has probably been optimized.
if (0){
int x = 2;
x++;
}
has been eliminated.
movl $0, %eax is where the return value been set. It seems the other instructions are just program init and exit.
There is a possibility that the compiler optimized it away, since it's never true.
The optimizer removed the if conditional and all of the code inside, so it doesn't show up at all.
the if (0) {} block has been optimized out by the compiler, as this will never be called.
so your function do only return 0 (movl $0, %eax)

Questions re: assembly generated from my C++ by gcc

Compiling this code:
int main ()
{
return 0;
}
using:
gcc -S filename.cpp
...generates this assembly:
.file "heloworld.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl $0, %eax
popl %ebp
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
My questions:
Is everything after "." a comment?
What is .LFB0:?
What is .LFE0:?
Why is it so big code only for "int main ()" and "return 0;"?
P.S. I read alot of assembly net books, a lot (at least 30) of tutorials and all I can do is copy code and paste it or rewrite it. Now I'm trying a different approach to try to learn it somehow. The problem is I do understand what are movl, pop, etc, but don't understand how to combine these things to make code "flow". I don't know where or how to correctly start writing a program in asm is. I'm still static not dynamic as in C++ but I want to learn assembly.
As other have said, .file, .text, ... are assembler directives and .LFB0, .LFE0 are local labels. The only instruction in the generated code are:
pushl %ebp
movl %esp, %ebp
movl $0, %eax
popl %ebp
ret
The first two instruction are the function prologue. The frame pointer is stored on the stack and updated. The next intruction store 0 in eax register (i386 ABI states that integer return value are returned via the eax register). The two last instructions are function epilogue. The frame pointer is restored, and then the function return to its caller via the ret instruction.
If you compile your code with -O3 -fomit-frame-pointer, the code will be compiled to just two instructions:
xorl %eax,%eax
ret
The first set eax to 0 (it only takes two bytes to encode, while movl 0,%eax take 5 bytes), and the second is the ret instruction. The frame pointer manipulation is there to ease debugging (it is possible to get backtrace without it, but it is more difficult).
.file, .text, etc are assembler directives.
.LFB0, .LFE0 are local labels, which are normally used as branch destinations within a function.
As for the size, there are really only a few actual instructions - most of the above listing consists of directives, etc. For future reference you might also want to turn up the optimisation level to remove otherwise redudant instructions, i.e. gcc -Wall -O3 -S ....
It's just that there's a lot going on behind your simple program.
If you intend to read assembler outputs, by no means compile C++. Use plain C, the output is far clearer for a number of reasons.

while (1) Vs. for (;;) Is there a speed difference?

Long version...
A co-worker asserted today after seeing my use of while (1) in a Perl script that for (;;) is faster. I argued that they should be the same hoping that the interpreter would optimize out any differences. I set up a script that would run 1,000,000,000 for loop iterations and the same number of while loops and record the time between. I could find no appreciable difference. My co-worker said that a professor had told him that the while (1) was doing a comparison 1 == 1 and the for (;;) was not. We repeated the same test with the 100x the number of iterations with C++ and the difference was negligible. It was however a graphic example of how much faster compiled code can be vs. a scripting language.
Short version...
Is there any reason to prefer a while (1) over a for (;;) if you need an infinite loop to break out of?
Note: If it's not clear from the question. This was purely a fun academic discussion between a couple of friends. I am aware this is not a super important concept that all programmers should agonize over. Thanks for all the great answers I (and I'm sure others) have learned a few things from this discussion.
Update: The aforementioned co-worker weighed in with a response below.
Quoted here in case it gets buried.
It came from an AMD assembly programmer. He stated that C programmers
(the poeple) don't realize that their code has inefficiencies. He said
today though, gcc compilers are very good, and put people like him out
of business. He said for example, and told me about the while 1 vs
for(;;). I use it now out of habit but gcc and especially interpreters
will do the same operation (a processor jump) for both these days,
since they are optimized.
In perl, they result in the same opcodes:
$ perl -MO=Concise -e 'for(;;) { print "foo\n" }'
a <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 2 -e:1) v ->3
9 <2> leaveloop vK/2 ->a
3 <{> enterloop(next->8 last->9 redo->4) v ->4
- <#> lineseq vK ->9
4 <;> nextstate(main 1 -e:1) v ->5
7 <#> print vK ->8
5 <0> pushmark s ->6
6 <$> const[PV "foo\n"] s ->7
8 <0> unstack v ->4
-e syntax OK
$ perl -MO=Concise -e 'while(1) { print "foo\n" }'
a <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 2 -e:1) v ->3
9 <2> leaveloop vK/2 ->a
3 <{> enterloop(next->8 last->9 redo->4) v ->4
- <#> lineseq vK ->9
4 <;> nextstate(main 1 -e:1) v ->5
7 <#> print vK ->8
5 <0> pushmark s ->6
6 <$> const[PV "foo\n"] s ->7
8 <0> unstack v ->4
-e syntax OK
Likewise in GCC:
#include <stdio.h>
void t_while() {
while(1)
printf("foo\n");
}
void t_for() {
for(;;)
printf("foo\n");
}
.file "test.c"
.section .rodata
.LC0:
.string "foo"
.text
.globl t_while
.type t_while, #function
t_while:
.LFB2:
pushq %rbp
.LCFI0:
movq %rsp, %rbp
.LCFI1:
.L2:
movl $.LC0, %edi
call puts
jmp .L2
.LFE2:
.size t_while, .-t_while
.globl t_for
.type t_for, #function
t_for:
.LFB3:
pushq %rbp
.LCFI2:
movq %rsp, %rbp
.LCFI3:
.L5:
movl $.LC0, %edi
call puts
jmp .L5
.LFE3:
.size t_for, .-t_for
.section .eh_frame,"a",#progbits
.Lframe1:
.long .LECIE1-.LSCIE1
.LSCIE1:
.long 0x0
.byte 0x1
.string "zR"
.uleb128 0x1
.sleb128 -8
.byte 0x10
.uleb128 0x1
.byte 0x3
.byte 0xc
.uleb128 0x7
.uleb128 0x8
.byte 0x90
.uleb128 0x1
.align 8
.LECIE1:
.LSFDE1:
.long .LEFDE1-.LASFDE1
.LASFDE1:
.long .LASFDE1-.Lframe1
.long .LFB2
.long .LFE2-.LFB2
.uleb128 0x0
.byte 0x4
.long .LCFI0-.LFB2
.byte 0xe
.uleb128 0x10
.byte 0x86
.uleb128 0x2
.byte 0x4
.long .LCFI1-.LCFI0
.byte 0xd
.uleb128 0x6
.align 8
.LEFDE1:
.LSFDE3:
.long .LEFDE3-.LASFDE3
.LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xe
.uleb128 0x10
.byte 0x86
.uleb128 0x2
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xd
.uleb128 0x6
.align 8
.LEFDE3:
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",#progbits
So I guess the answer is, they're the same in many compilers. Of course, for some other compilers this may not necessarily be the case, but chances are the code inside of the loop is going to be a few thousand times more expensive than the loop itself anyway, so who cares?
There's not much reason to prefer one over the other. I do think that while(1) and particularly while(true) are more readable than for(;;), but that's just my preference.
Using GCC, they both seem to compile to the same assembly language:
L2:
jmp L2
There is no difference according to the standard. 6.5.3/1 has:
The for statement
for ( for-init-statement ; conditionopt ; expressionopt ) statement
is equivalent to
{
for-init-statement
while ( condition ) {
statement
expression ;
}
}
And 6.5.3/2 has:
Either or both of the condition and the expression can be omitted. A missing condition makes the implied while clause equivalent to while(true).
So according to the C++ standard the code:
for (;;);
is exactly the same as:
{
while (true) {
;
;
}
}
for(;;) is one less character to type if you want to go in that direction to optimize things.
The Visual C++ compiler used to emit a warning for
while (1)
(constant expression) but not for
for (;;)
I've continued the practice of preferring for (;;) for that reason, but I don't know if the compiler still does that these days.
Turbo C with this old compilers for(;;) results in faster code then while(1).
Today gcc, Visual C (I think almost all) compilers optimize well, and CPUs with 4.7 MHz are rarely used.
In those days a for( i=10; i; i-- ) was faster than for( i=1; i <=10; i++ ), because compare i is 0, results in a CPU-Zero-Flag conditional Jump. And the Zero-Flag was modified with the last decrement operation ( i-- ), no extra cmp-operation is needed.
call __printf_chk
decl %ebx %ebx=iterator i
jnz .L2
movl -4(%ebp), %ebx
leave
and here with for(i=1; i<=10; i++) with extra cmpl:
call __printf_chk
incl %ebx
cmpl $11, %ebx
jne .L2
movl -4(%ebp), %ebx
leave
For all the people arguing you shouldn't use indefinte while loops, and suggesting daft stuff like using open goto's ( seriously, ouch )
while (1) {
last if( condition1 );
code();
more_code();
last if( condition2 );
even_more_code();
}
Can't really be represented effectively any other way. Not without creating an exit variable and doing black magic to keep it synced.
If you have a penchant for the more goto-esque syntax, use something sane that limits scope.
flow: {
if ( condition ){
redo flow;
}
if ( othercondition ){
redo flow;
}
if ( earlyexit ){
last flow;
}
something(); # doesn't execute when earlyexit is true
}
Ultimately Speed is not that important
Worring about how effective speed wise different looping constructs are is a massive waste of time. Premature optimization through and through. I can't think of any situation I've ever seen where profiling code found bottlenecks in my choice of looping construct.
Generally its the how of the loop and the what of the loop.
You should "optimize" for readability and succinctness, and write whatever is best at explaining the problem to the next poor sucker who finds your code.
If you use the "goto LABEL" trick somebody mentioned, and I have to use your code, be prepared to sleep with one eye open, especially if you do it more than once, because that sort of stuff creates horrifically spaghetti code.
Just because you can create spaghetti code doesn't mean you should
If compiler doesn't do any optimization, for(;;) would always be faster than while(true). This is because while-statement evaluates the condition everytime, but for-statement is an unconditional jump. But if compiler optimizes the control flow, it may generate some opcodes. You can read disassembly code very easily.
P.S. you could write a infinite loop like this:
#define EVER ;;
//...
for (EVER) {
//...
}
From Stroustrup, TC++PL (3rd edition), §6.1.1:
The curious notation for (;;) is the standard way to specify an infinite loop; you could pronounce it "forever". [...] while (true) is an alternative.
I prefer for (;;).
I heard about this once.
It came from an AMD assembly programmer. He stated that C programmers (the people) don't realize that their code has inefficiencies. He said today though, gcc compilers are very good, and put people like him out of business. He said for example, and told me about the while 1 vs for(;;). I use it now out of habit but gcc and especially interpreters will do the same operation (a processor jump) for both these days, since they are optimized.
In an optimized build of a compiled language, there should be no appreciable difference between the two. Neither should end up performing any comparisons at runtime, they will just execute the loop code until you manually exit the loop (e.g. with a break).
Just came across this thread (although quite a few years late).
I think I found the actual reason why "for(;;)" is better than "while(1)".
according to the "barr coding standard 2018"
Kernighan & Ritchie long ago recommended for (;;) , which has the additional benefit
of insuring against the visually-confusing defect of a while (l); referencing a variable ‘l’.
basically, this is not a speed issue but a readability issue. Depending on the font/print of code the number one(1) in a while may look like a lower case letter l.
i.e 1 vs l. (in some fonts these look identical).
So while(1) may look like some while loop dependent on the variable letter L.
while(true) may also work but in some older C and embedded C cases true/false are not yet defined unless stdbool.h is included.
I'm surprised no one has offered the more direct form, corresponding to the desired assembly:
forever:
do stuff;
goto forever;
I am surprised that nobody properly tested for (;;) versus while (1) in perl!
Because perl is interpreted language, the time to run a perl script does not only consist of the execution phase (which in this case is the same) but also of the interpretation phase before execution. Both of these phases have to be taken in account when making a speed comparison.
Luckily perl has a convenient Benchmark module which we can use to implement a benchmark such as follows:
#!/usr/bin/perl -w
use Benchmark qw( cmpthese );
sub t_for { eval 'die; for (;;) { }'; }
sub t_for2 { eval 'die; for (;;) { }'; }
sub t_while { eval 'die; while (1) { }'; }
cmpthese(-60, { for => \&t_for, for2 => \&t_for2, while => \&t_while });
Note that I am testing two different versions of the infinite for loop: one which is shorter than the while loop and another one which has an extra space to make it the same length as the while loop.
On Ubuntu 11.04 x86_64 with perl 5.10.1 I get the following results:
Rate for for2 while
for 100588/s -- -0% -2%
for2 100937/s 0% -- -1%
while 102147/s 2% 1% --
The while loop is clearly the winner on this platform.
On FreeBSD 8.2 x86_64 with perl 5.14.1:
Rate for for2 while
for 53453/s -- -0% -2%
for2 53552/s 0% -- -2%
while 54564/s 2% 2% --
While loop is the winner here too.
On FreeBSD 8.2 i386 with perl 5.14.1:
Rate while for for2
while 24311/s -- -1% -1%
for 24481/s 1% -- -1%
for2 24637/s 1% 1% --
Surprisingly the for loop with an extra space is the fastest choice here!
My conclusion is that the while loop should be used on x86_64 platform if the programmer is optimizing for speed. Obviously a for loop should be used when optimizing for space. My results are unfortunately inconclusive regarding other platforms.
In theory, a completely naive compiler could store the literal '1' in the binary (wasting space) and check to see if 1 == 0 every iteration (wasting time and more space).
In reality, however, even with "no" optimizations, compilers will still reduce both to the same. They may also emit warnings because it could indicate a logical error. For instance, the argument of while could be defined somewhere else and you not realize it's constant.
while(1) is an idiom for for(;;) which is recognized by most compilers.
I was glad to see that perl recognizes until(0), too.
To summarize the for (;;) vs while (1) debate it is obvious that the former was faster in the days of older non-optimizing compilers, that is why you tend to see it in older code bases such as Lions Unix Source code commentary, however in the age of badass optimizing compilers those gains are optimized away coupling that with the fact that the latter is easier to understand than the former I believe that it would be more preferable.
I would think that both are the same in terms of performance. But I would prefer while(1) for readability but I question why you need an infinite loop.
They are the same. There are much more important questions to ponder.
My point which was implied but not explicitly made above, is that a decent compiler would generate the exact same code for both loop forms. The bigger point is that the looping construct is a minor part of the run time of any algorithm, and you must first ensure that you have optimized the algorithm and everything else related to it. Optimizing your loop construct should absolutely be at the bottom of your priority list.