How do exceptions work (behind the scenes) in c++ - c++

I keep seeing people say that exceptions are slow, but I never see any proof. So, instead of asking if they are, I will ask how do exceptions work behind the scenes, so I can make decisions of when to use them and whether they are slow.
From what I know, exceptions are the same as doing a return bunch of times, except that it also checks after each return whether it needs to do another one or to stop. How does it check when to stop returning? I guess there is a second stack that holds the type of the exception and a stack location, it then does returns until it gets there. I am also guessing that the only time this second stack is touched is on a throw and on each try/catch. AFAICT implementing a similar behaviour with return codes would take the same amount of time. But this is all just a guess, so I want to know what really happens.
How do exceptions really work?

Instead of guessing, I decided to actually look at the generated code with a small piece of C++ code and a somewhat old Linux install.
class MyException
{
public:
MyException() { }
~MyException() { }
};
void my_throwing_function(bool throwit)
{
if (throwit)
throw MyException();
}
void another_function();
void log(unsigned count);
void my_catching_function()
{
log(0);
try
{
log(1);
another_function();
log(2);
}
catch (const MyException& e)
{
log(3);
}
log(4);
}
I compiled it with g++ -m32 -W -Wall -O3 -save-temps -c, and looked at the generated assembly file.
.file "foo.cpp"
.section .text._ZN11MyExceptionD1Ev,"axG",#progbits,_ZN11MyExceptionD1Ev,comdat
.align 2
.p2align 4,,15
.weak _ZN11MyExceptionD1Ev
.type _ZN11MyExceptionD1Ev, #function
_ZN11MyExceptionD1Ev:
.LFB7:
pushl %ebp
.LCFI0:
movl %esp, %ebp
.LCFI1:
popl %ebp
ret
.LFE7:
.size _ZN11MyExceptionD1Ev, .-_ZN11MyExceptionD1Ev
_ZN11MyExceptionD1Ev is MyException::~MyException(), so the compiler decided it needed a non-inline copy of the destructor.
.globl __gxx_personality_v0
.globl _Unwind_Resume
.text
.align 2
.p2align 4,,15
.globl _Z20my_catching_functionv
.type _Z20my_catching_functionv, #function
_Z20my_catching_functionv:
.LFB9:
pushl %ebp
.LCFI2:
movl %esp, %ebp
.LCFI3:
pushl %ebx
.LCFI4:
subl $20, %esp
.LCFI5:
movl $0, (%esp)
.LEHB0:
call _Z3logj
.LEHE0:
movl $1, (%esp)
.LEHB1:
call _Z3logj
call _Z16another_functionv
movl $2, (%esp)
call _Z3logj
.LEHE1:
.L5:
movl $4, (%esp)
.LEHB2:
call _Z3logj
addl $20, %esp
popl %ebx
popl %ebp
ret
.L12:
subl $1, %edx
movl %eax, %ebx
je .L16
.L14:
movl %ebx, (%esp)
call _Unwind_Resume
.LEHE2:
.L16:
.L6:
movl %eax, (%esp)
call __cxa_begin_catch
movl $3, (%esp)
.LEHB3:
call _Z3logj
.LEHE3:
call __cxa_end_catch
.p2align 4,,3
jmp .L5
.L11:
.L8:
movl %eax, %ebx
.p2align 4,,6
call __cxa_end_catch
.p2align 4,,6
jmp .L14
.LFE9:
.size _Z20my_catching_functionv, .-_Z20my_catching_functionv
.section .gcc_except_table,"a",#progbits
.align 4
.LLSDA9:
.byte 0xff
.byte 0x0
.uleb128 .LLSDATT9-.LLSDATTD9
.LLSDATTD9:
.byte 0x1
.uleb128 .LLSDACSE9-.LLSDACSB9
.LLSDACSB9:
.uleb128 .LEHB0-.LFB9
.uleb128 .LEHE0-.LEHB0
.uleb128 0x0
.uleb128 0x0
.uleb128 .LEHB1-.LFB9
.uleb128 .LEHE1-.LEHB1
.uleb128 .L12-.LFB9
.uleb128 0x1
.uleb128 .LEHB2-.LFB9
.uleb128 .LEHE2-.LEHB2
.uleb128 0x0
.uleb128 0x0
.uleb128 .LEHB3-.LFB9
.uleb128 .LEHE3-.LEHB3
.uleb128 .L11-.LFB9
.uleb128 0x0
.LLSDACSE9:
.byte 0x1
.byte 0x0
.align 4
.long _ZTI11MyException
.LLSDATT9:
Surprise! There are no extra instructions at all on the normal code path. The compiler instead generated extra out-of-line fixup code blocks, referenced via a table at the end of the function (which is actually put on a separate section of the executable). All the work is done behind the scenes by the standard library, based on these tables (_ZTI11MyException is typeinfo for MyException).
OK, that was not actually a surprise for me, I already knew how this compiler did it. Continuing with the assembly output:
.text
.align 2
.p2align 4,,15
.globl _Z20my_throwing_functionb
.type _Z20my_throwing_functionb, #function
_Z20my_throwing_functionb:
.LFB8:
pushl %ebp
.LCFI6:
movl %esp, %ebp
.LCFI7:
subl $24, %esp
.LCFI8:
cmpb $0, 8(%ebp)
jne .L21
leave
ret
.L21:
movl $1, (%esp)
call __cxa_allocate_exception
movl $_ZN11MyExceptionD1Ev, 8(%esp)
movl $_ZTI11MyException, 4(%esp)
movl %eax, (%esp)
call __cxa_throw
.LFE8:
.size _Z20my_throwing_functionb, .-_Z20my_throwing_functionb
Here we see the code for throwing an exception. While there was no extra overhead simply because an exception might be thrown, there is obviously a lot of overhead in actually throwing and catching an exception. Most of it is hidden within __cxa_throw, which must:
Walk the stack with the help of the exception tables until it finds a handler for that exception.
Unwind the stack until it gets to that handler.
Actually call the handler.
Compare that with the cost of simply returning a value, and you see why exceptions should be used only for exceptional returns.
To finish, the rest of the assembly file:
.weak _ZTI11MyException
.section .rodata._ZTI11MyException,"aG",#progbits,_ZTI11MyException,comdat
.align 4
.type _ZTI11MyException, #object
.size _ZTI11MyException, 8
_ZTI11MyException:
.long _ZTVN10__cxxabiv117__class_type_infoE+8
.long _ZTS11MyException
.weak _ZTS11MyException
.section .rodata._ZTS11MyException,"aG",#progbits,_ZTS11MyException,comdat
.type _ZTS11MyException, #object
.size _ZTS11MyException, 14
_ZTS11MyException:
.string "11MyException"
The typeinfo data.
.section .eh_frame,"a",#progbits
.Lframe1:
.long .LECIE1-.LSCIE1
.LSCIE1:
.long 0x0
.byte 0x1
.string "zPL"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x6
.byte 0x0
.long __gxx_personality_v0
.byte 0x0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
.LECIE1:
.LSFDE3:
.long .LEFDE3-.LASFDE3
.LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB9
.long .LFE9-.LFB9
.uleb128 0x4
.long .LLSDA9
.byte 0x4
.long .LCFI2-.LFB9
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI5-.LCFI3
.byte 0x83
.uleb128 0x3
.align 4
.LEFDE3:
.LSFDE5:
.long .LEFDE5-.LASFDE5
.LASFDE5:
.long .LASFDE5-.Lframe1
.long .LFB8
.long .LFE8-.LFB8
.uleb128 0x4
.long 0x0
.byte 0x4
.long .LCFI6-.LFB8
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI7-.LCFI6
.byte 0xd
.uleb128 0x5
.align 4
.LEFDE5:
.ident "GCC: (GNU) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)"
.section .note.GNU-stack,"",#progbits
Even more exception handling tables, and assorted extra information.
So, the conclusion, at least for GCC on Linux: the cost is extra space (for the handlers and tables) whether or not exceptions are thrown, plus the extra cost of parsing the tables and executing the handlers when an exception is thrown. If you use exceptions instead of error codes, and an error is rare, it can be faster, since you do not have the overhead of testing for errors anymore.
In case you want more information, in particular what all the __cxa_ functions do, see the original specification they came from:
Itanium C++ ABI

Exceptions being slow was true in the old days.
In most modern compiler this no longer holds true.
Note: Just because we have exceptions does not mean we do not use error codes as well. When error can be handled locally use error codes. When errors require more context for correction use exceptions: I wrote it much more eloquently here: What are the principles guiding your exception handling policy?
The cost of exception handling code when no exceptions are being used is practically zero.
When an exception is thrown there is some work done.
But you have to compare this against the cost of returning error codes and checking them all the way back to to point where the error can be handled. Both more time consuming to write and maintain.
Also there is one gotcha for novices:
Though Exception objects are supposed to be small some people put lots of stuff inside them. Then you have the cost of copying the exception object. The solution there is two fold:
Don't put extra stuff in your exception.
Catch by const reference.
In my opinion I would bet that the same code with exceptions is either more efficient or at least as comparable as the code without the exceptions (but has all the extra code to check function error results). Remember you are not getting anything for free the compiler is generating the code you should have written in the first place to check error codes (and usually the compiler is much more efficient than a human).

There are a number of ways you could implement exceptions, but typically they will rely on some underlying support from the OS. On Windows this is the structured exception handling mechanism.
There is decent discussion of the details on Code Project: How a C++ compiler implements exception handling
The overhead of exceptions occurs because the compiler has to generate code to keep track of which objects must be destructed in each stack frame (or more precisely scope) if an exception propagates out of that scope. If a function has no local variables on the stack that require destructors to be called then it should not have a performance penalty wrt exception handling.
Using a return code can only unwind a single level of the stack at a time, whereas an exception handling mechanism can jump much further back down the stack in one operation if there is nothing for it to do in the intermediate stack frames.

Matt Pietrek wrote an excellent article on Win32 Structured Exception Handling. While this article was originally written in 1997, it still applies today (but of course only applies to Windows).

This article examines the issue and basically finds that in practice there is a run-time cost to exceptions, although the cost is fairly low if the exception isn't thrown. Good article, recommended.

A friend of me wrote a bit how Visual C++ handles exceptions some years ago.
http://www.xyzw.de/c160.html

All good answers.
Also, think about how much easier it is to debug code that does 'if checks' as gates at the top of methods instead of allowing the code to throw exceptions.
My motto is that it's easy to write code that works. The most important thing is to write the code for the next person who looks at it. In some cases, it's you in 9 months, and you don't want to be cursing your name!

Related

How to remove "noise" from GCC/clang assembly output?

I want to inspect the assembly output of applying boost::variant in my code in order to see which intermediate calls are optimized away.
When I compile the following example (with GCC 5.3 using g++ -O3 -std=c++14 -S), it seems as if the compiler optimizes away everything and directly returns 100:
(...)
main:
.LFB9320:
.cfi_startproc
movl $100, %eax
ret
.cfi_endproc
(...)
#include <boost/variant.hpp>
struct Foo
{
int get() { return 100; }
};
struct Bar
{
int get() { return 999; }
};
using Variant = boost::variant<Foo, Bar>;
int run(Variant v)
{
return boost::apply_visitor([](auto& x){return x.get();}, v);
}
int main()
{
Foo f;
return run(f);
}
However, the full assembly output contains much more than the above excerpt, which to me looks like it is never called. Is there a way to tell GCC/clang to remove all that "noise" and just output what is actually called when the program is ran?
full assembly output:
.file "main1.cpp"
.section .rodata.str1.8,"aMS",#progbits,1
.align 8
.LC0:
.string "/opt/boost/include/boost/variant/detail/forced_return.hpp"
.section .rodata.str1.1,"aMS",#progbits,1
.LC1:
.string "false"
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LCOLDB2:
.section .text._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LHOTB2:
.p2align 4,,15
.weak _ZN5boost6detail7variant13forced_returnIvEET_v
.type _ZN5boost6detail7variant13forced_returnIvEET_v, #function
_ZN5boost6detail7variant13forced_returnIvEET_v:
.LFB1197:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $_ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, %ecx
movl $49, %edx
movl $.LC0, %esi
movl $.LC1, %edi
call __assert_fail
.cfi_endproc
.LFE1197:
.size _ZN5boost6detail7variant13forced_returnIvEET_v, .-_ZN5boost6detail7variant13forced_returnIvEET_v
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LCOLDE2:
.section .text._ZN5boost6detail7variant13forced_returnIvEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIvEET_v,comdat
.LHOTE2:
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LCOLDB3:
.section .text._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LHOTB3:
.p2align 4,,15
.weak _ZN5boost6detail7variant13forced_returnIiEET_v
.type _ZN5boost6detail7variant13forced_returnIiEET_v, #function
_ZN5boost6detail7variant13forced_returnIiEET_v:
.LFB9757:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $_ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, %ecx
movl $39, %edx
movl $.LC0, %esi
movl $.LC1, %edi
call __assert_fail
.cfi_endproc
.LFE9757:
.size _ZN5boost6detail7variant13forced_returnIiEET_v, .-_ZN5boost6detail7variant13forced_returnIiEET_v
.section .text.unlikely._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LCOLDE3:
.section .text._ZN5boost6detail7variant13forced_returnIiEET_v,"axG",#progbits,_ZN5boost6detail7variant13forced_returnIiEET_v,comdat
.LHOTE3:
.section .text.unlikely,"ax",#progbits
.LCOLDB4:
.text
.LHOTB4:
.p2align 4,,15
.globl _Z3runN5boost7variantI3FooJ3BarEEE
.type _Z3runN5boost7variantI3FooJ3BarEEE, #function
_Z3runN5boost7variantI3FooJ3BarEEE:
.LFB9310:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl (%rdi), %eax
cltd
xorl %edx, %eax
cmpl $19, %eax
ja .L7
jmp *.L9(,%rax,8)
.section .rodata
.align 8
.align 4
.L9:
.quad .L30
.quad .L10
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.quad .L7
.text
.p2align 4,,10
.p2align 3
.L7:
call _ZN5boost6detail7variant13forced_returnIiEET_v
.p2align 4,,10
.p2align 3
.L30:
movl $100, %eax
.L8:
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L10:
.cfi_restore_state
movl $999, %eax
jmp .L8
.cfi_endproc
.LFE9310:
.size _Z3runN5boost7variantI3FooJ3BarEEE, .-_Z3runN5boost7variantI3FooJ3BarEEE
.section .text.unlikely
.LCOLDE4:
.text
.LHOTE4:
.globl _Z3runN5boost7variantI3FooI3BarEEE
.set _Z3runN5boost7variantI3FooI3BarEEE,_Z3runN5boost7variantI3FooJ3BarEEE
.section .text.unlikely
.LCOLDB5:
.section .text.startup,"ax",#progbits
.LHOTB5:
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB9320:
.cfi_startproc
movl $100, %eax
ret
.cfi_endproc
.LFE9320:
.size main, .-main
.section .text.unlikely
.LCOLDE5:
.section .text.startup
.LHOTE5:
.section .rodata
.align 32
.type _ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, #object
.size _ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__, 58
_ZZN5boost6detail7variant13forced_returnIvEET_vE19__PRETTY_FUNCTION__:
.string "T boost::detail::variant::forced_return() [with T = void]"
.align 32
.type _ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, #object
.size _ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__, 57
_ZZN5boost6detail7variant13forced_returnIiEET_vE19__PRETTY_FUNCTION__:
.string "T boost::detail::variant::forced_return() [with T = int]"
.ident "GCC: (Ubuntu 5.3.0-3ubuntu1~14.04) 5.3.0 20151204"
.section .note.GNU-stack,"",#progbits
Stripping out the .cfi directives, unused labels, and comment lines is a solved problem: the scripts behind Matt Godbolt's compiler explorer are open source on its github project. It can even do colour highlighting to match source lines to asm lines (using the debug info).
You can set it up locally so you can feed it files that are part of your project with all the #include paths and so on (using -I/...). And so you can use it on private source code that you don't want to send out over the Internet.
Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” shows how to use it (it's pretty self-explanatory but has some neat features if you read the docs on github), and also how to read x86 asm, with a gentle introduction to x86 asm itself for total beginners, and to looking at compiler output. He goes on to show some neat compiler optimizations (e.g. for dividing by a constant), and what kind of functions give useful asm output for looking at optimized compiler output (function args, not int a = 123;).
On the Godbolt compiler explorer, it can be useful to use -g0 -fno-asynchronous-unwind-tables if you want to uncheck the filter option for directives, e.g. because you want to see the .section and .p2align stuff in the compiler output. The default is to add -g to your options to get the debug info it uses to colour-highlight matching source and asm lines, but this means .cfi directives for every stack operation, and .loc for every source line, among other things.
With plain gcc/clang (not g++), -fno-asynchronous-unwind-tables avoids .cfi directives. Possibly also useful: -fno-exceptions -fno-rtti -masm=intel. Make sure to omit -g.
Copy/paste this for local use:
g++ -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -fverbose-asm \
-Wall -Wextra foo.cpp -O3 -masm=intel -S -o- | less
Or -Os can be more readable, e.g. using div for division by non-power-of-2 constants instead of a multiplicative inverse even though that's a lot worse for performance and only a bit smaller, if at all.
But really, I'd recommend just using Godbolt directly (online or set it up locally)! You can quickly flip between versions of gcc and clang to see if old or new compilers do something dumb. (Or what ICC does, or even what MSVC does.) There's even ARM / ARM64 gcc 6.3, and various gcc for PowerPC, MIPS, AVR, MSP430. (It can be interesting to see what happens on a machine where int is wider than a register, or isn't 32-bit. Or on a RISC vs. x86).
For C instead of C++, you can use -xc -std=gnu11 to avoid flipping the language drop-down to C, which resets your source pane and compiler choices, and has a different set of compilers available.
Useful compiler options for making asm for human consumption:
Remember, your code only has to compile, not link: passing a pointer to an external function like void ext(void*p) is a good way to stop something from optimizing away. You only need a prototype for it, with no definition so the compiler can't inline it or make any assumptions about what it does. (Or inline asm like Benchmark::DoNotOptimize can force a compiler to materialize a value in a register, or forget about it being a known constant, if you know GNU C inline asm syntax well enough to use constraints to understand the effect you're having on what you're requiring of the compiler.)
I'd recommend using -O3 -Wall -Wextra -fverbose-asm -march=haswell for looking at code. (-fverbose-asm can just make the source look noisy, though, when all you get are numbered temporaries as names for the operands.) When you're fiddling with the source to see how it changes the asm, you definitely want compiler warnings enabled. You don't want to waste time scratching your head over the asm when the explanation is that you did something that deserves a warning in the source.
To see how the calling convention works, you often want to look at caller and callee without inlining.
You can use __attribute__((noipa)) foo_t foo(bar_t x) { ... } on a definition, or compile with gcc -O3 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions to disable inlining. (But those command line options don't disable cloning a function for constant-propagation. noipa = no Inter-Procedural Analysis. It's even stronger than __attribute__((noinline,noclone)).) See From compiler perspective, how is reference for array dealt with, and, why passing by value(not decay) is not allowed? for an example.
Or if you just want to see how functions pass / receive args of different types, you could use different names but the same prototype so the compiler doesn't have a definition to inline. This works with any compiler. Without a definition, a function is just a black box to the optimizer, governed only by the calling convention / ABI.
-ffast-math will get many libm functions to inline, some to a single instruction (esp. with SSE4 available for roundsd). Some will inline with just -fno-math-errno, or other "safer" parts of -ffast-math, without the parts that allow the compiler to round differently. If you have FP code, definitely look at it with/without -ffast-math. If you can't safely enable any of -ffast-math in your regular build, maybe you'll get an idea for a safe change you can make in the source to allow the same optimization without -ffast-math.
-O3 -fno-tree-vectorize will optimize without auto-vectorizing, so you can get full optimization without if you want to compare with -O2 (which doesn't enable autovectorization on gcc11 and earlier, but does on all clang).
-Os (optimize for size and speed) can be helpful to keep the code more compact, which means less code to understand. clang's -Oz optimizes for size even when it hurts speed, even using push 1 / pop rax instead of mov eax, 1, so that's only interesting for code golf.
Even -Og (minimal optimization) might be what you want to look at, depending on your goals. -O0 is full of store/reload noise, which makes it harder to follow, unless you use register vars. The only upside is that each C statement compiles to a separate block of instructions, and it makes -fverbose-asm able to use the actual C var names.
clang unrolls loops by default, so -fno-unroll-loops can be useful in complex functions. You can get a sense of "what the compiler did" without having to wade through the unrolled loops. (gcc enables -funroll-loops with -fprofile-use, but not with -O3). (This is a suggestion for human-readable code, not for code that would run faster.)
Definitely enable some level of optimization, unless you specifically want to know what -O0 did. Its "predictable debug behaviour" requirement makes the compiler store/reload everything between every C statement, so you can modify C variables with a debugger and even "jump" to a different source line within the same function, and have execution continue as if you did that in the C source. -O0 output is so noisy with stores/reloads (and so slow) not just from lack of optimization, but forced de-optimization to support debugging. (also related).
To get a mix of source and asm, use gcc -Wa,-adhln -c -g foo.c | less to pass extra options to as. (More discussion of this in a blog post, and another blog.). Note that the output of this isn't valid assembler input, because the C source is there directly, not as an assembler comment. So don't call it a .s. A .lst might make sense if you want to save it to a file.
Godbolt's color highlighting serves a similar purpose, and is great at helping you see when multiple non-contiguous asm instructions come from the same source line. I haven't used that gcc listing command at all, so IDK how well it does, and how easy it is for the eye to see, in that case.
I like the high code density of godbolt's asm pane, so I don't think I'd like having source lines mixed in. At least not for simple functions. Maybe with a function that was too complex to get a handle on the overall structure of what the asm does...
And remember, when you want to just look at the asm, leave out the main() and the compile-time constants. You want to see the code for dealing with a function arg in a register, not for the code after constant-propagation turns it into return 42, or at least optimizes away some stuff.
Removing static and/or inline from functions will produce a stand-alone definition for them, as well as a definition for any callers, so you can just look at that.
Don't put your code in a function called main(). gcc knows that main is special and assumes it will only be called once, so it marks it as "cold" and optimizes it less.
The other thing you can do: If you did make a main(), you can run it and use a debugger. stepi (si) steps by instruction. See the bottom of the x86 tag wiki for instructions. But remember that code might optimize away after inlining into main with compile-time-constant args.
__attribute__((noinline)) may help, on a function that you want to not be inlined. gcc will also make constant-propagation clones of functions, i.e. a special version with one of the args as a constant, for call-sites that know they're passing a constant. The symbol name will be .clone.foo.constprop_1234 or something in the asm output. You can use __attribute__((noclone)) to disable that, too.).
For example
If you want to see how the compiler multiplies two integers: I put the following code on the Godbolt compiler explorer to get the asm (from gcc -O3 -march=haswell -fverbose-asm) for the wrong way and the right way to test this.
// the wrong way, which people often write when they're used to creating a runnable test-case with a main() and a printf
// or worse, people will actually look at the asm for such a main()
int constants() { int a = 10, b = 20; return a * b; }
mov eax, 200 #,
ret # compiles the same as return 200; not interesting
// the right way: compiler doesn't know anything about the inputs
// so we get asm like what would happen when this inlines into a bigger function.
int variables(int a, int b) { return a * b; }
mov eax, edi # D.2345, a
imul eax, esi # D.2345, b
ret
(This mix of asm and C was hand-crafted by copy-pasting the asm output from godbolt into the right place. I find it's a good way to show how a short function compiles in SO answers / compiler bug reports / emails.)
You can always look at the generated assembly from the object file, instead of using the compilers assembly output. objdump comes to mind.
You can even tell objdump to intermix source with assembly, making it easier to figure out what source line corresponds to what instructions. Example session:
$ cat test.cc
int foo(int arg)
{
return arg * 42;
}
$ g++ -g -O3 -std=c++14 -c test.cc -o test.o && objdump -dS -M intel test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z3fooi>:
int foo(int arg)
{
return arg + 1;
0: 8d 47 01 lea eax,[rdi+0x1]
}
3: c3 ret
Explanation of objdump flags:
-d disassembles all executable sections
-S intermixes assembly with source (-g required while compiling with g++)
-M intel choses intel syntax over ugly AT&T syntax (optional)
I like to insert labels that I can easily grep out of the objdump output.
int main() {
asm volatile ("interesting_part_begin%=:":);
do_something();
asm volatile ("interesting_part_end%=:":);
}
I haven't had a problem with this yet, but asm volatile can be very hard on a compiler's optimizer because it tends to leave such code untouched.

assembly output of a simple C++ program

I am trying to understand the assembly output of a simple c++ program. This is my C++ program.
void func()
{}
int main()
{
func();
}
when I use g++ with --save-temps option to get the assembly code for the above program I get the following assembly code.
.file "main.cpp"
.text
.globl _Z4funcv
.type _Z4funcv, #function
_Z4funcv:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size _Z4funcv, .-_Z4funcv
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
call _Z4funcv
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
According to my knowledge on assembly there should be 3 sections of any assembly program which are data, text and bss. Also text section should start with 'global _start'. I can't see any of them in this assembly code.
Can someone please help me to understand the above assembly code. If you can relate to C++ code as well, It would be great.
Any kind of help is greatly appreciated.
Well, here it is line by line...
.file "main.cpp" # Debugging info (not essential)
.text # Start of text section (i.e. your code)
.globl _Z4funcv # Let the function _Z4funcv be callable
# from outside (e.g. from your main routine)
.type _Z4funcv, #function # Debugging info (possibly not essential)
_Z4funcv: # _Z4funcv is effectively the "name" of your
# function (C++ "mangles" the name; exactly
# how depends on your compiler -- Google "C++
# name mangling" for more).
.LFB0: # Debugging info (possibly not essential)
.cfi_startproc # Provides additional debug info (ditto)
pushq %rbp # Store base pointer of caller function
# (standard function prologue -- Google
# "calling convention" or "cdecl")
.cfi_def_cfa_offset 16 # Provides additional debug info (ditto)
.cfi_offset 6, -16 # Provides additional debug info (ditto)
movq %rsp, %rbp # Reset base pointer to a sensible place
# for this function to put its local
# variables (if any). Standard function
# prologue.
.cfi_def_cfa_register 6 # Debug ...
popq %rbp # Restore the caller's base pointer
# Standard function epilogue
.cfi_def_cfa 7, 8 # Debug...
ret # Return from function
.cfi_endproc # Debug...
.LFE0: # Debug...
.size _Z4funcv, .-_Z4funcv # Debug...
.globl main # Declares that the main function
# is callable from outside
.type main, #function # Debug...
main: # Your main routine (name not mangled)
.LFB1: # Debug...
.cfi_startproc # Debug...
pushq %rbp # Store caller's base pointer
# (standard prologue)
.cfi_def_cfa_offset 16 # Debug...
.cfi_offset 6, -16 # Debug...
movq %rsp, %rbp # Reset base pointer
# (standard prologue)
.cfi_def_cfa_register 6 # Debug...
call _Z4funcv # Call `func` (note name mangled)
movl $0, %eax # Put `0` in eax (eax is return value)
popq %rbp # Restore caller's base pointer
# (standard epilogue)
.cfi_def_cfa 7, 8 # Debug...
ret # Return from main function
.cfi_endproc # Debug...
.LFE1:
.size main, .-main # Debug...
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2" # fluff
.section .note.GNU-stack,"",#progbits # fluff
The linker knows to look for main (and not start) if it is using the standard C or C++ library (which it usually is, unless you tell it otherwise). It links some stub code (which contains start) into the final executable.
So, really, the only important bits are...
.text
.globl _Z4funcv
_Z4funcv:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
.globl main
main:
pushq %rbp
movq %rsp, %rbp
call _Z4funcv
movl $0, %eax
popq %rbp
ret
If you want to start from scratch, and not have all the complicated standard library stuff getting in the way of your discovery, you can do something like this and achieve the same result as your C++ code:
.text
.globl _func
_func: # Just as above, really
push %ebp
mov %esp, %ebp
pop %ebp
ret
.globl _start
_start: # A few changes here
push %ebp
mov %esp, %ebp
call _func
movl $1, %eax # Invoke the Linux 'exit' syscall
movl $0, %ebx # With a return value of 0 (pick any char!)
int $0x80 # Actual invocation
The exit syscall is a bit painful, but necessary. If you don't have it, it tries to keep going and run the code that is "past" your code. As that could be important code or data, the machine should stop you with a Segmentation Fault error. Having the exit call avoids all this. If you are using the standard library (as will happen automatically in your C++ example) the exit stuff is taken care of by the linker.
Compile with gcc -nostdlib -o test test.s (noting that gcc is specifically told not to use the standard library). I should say that this is for a 32-bit system, and quite likely will not work on 64-bit. I don't have a 64-bit system to test on, but perhaps some helpful StackOverflower will chip in with a 64-bit translation.

how does the std::sqrt() function work? [duplicate]

This question already has answers here:
How is the square root function implemented? [closed]
(15 answers)
Closed 4 years ago.
Does anyone know how the std::sqrt() function works? (or at least have an idea?)
I've seen methods on the internet that seemed really slow, using lots of approximations and iterations.
Everyone knows sqrt() function is slow, but I'd like to know how the one from std works so I could have a vague idea of when it is beneficial to avoid it. (yes, if I want to be sure I can profile, but it's still nice to have a vague idea)
EDIT: Didn't really formulate the question too well... What I'm interested in:
how would the fastest C++ function calculating a square root look like? (more or less, I just want to know the actual logic behind it)
Nowadays, on modern machines, floating point functions are passed off to the hardware (floating point unit or math-coprocessor).
Sometimes, it uses what the CPU offers:
$ cat main.cc
#include <cmath>
#include <ctime>
#include <cstdlib>
int main(){
srand (clock());
const double d = rand();
return std::sqrt(d) > 2 ? 1 : 0;
}
(the blahblah is just so nothing relevant is optimized away, don't run that program!)
$ g++ -S main.cc
$ cat main.s
.file "main.cc"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB106:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
call clock
movl %eax, %edi
call srand
call rand
cvtsi2sd %eax, %xmm1
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L5
.L2:
xorl %eax, %eax
ucomisd .LC0(%rip), %xmm0
seta %al
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
movapd %xmm1, %xmm0
call sqrt
jmp .L2
.cfi_endproc
.LFE106:
.size main, .-main
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 0
.long 1073741824
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",#progbits
(hint: it is using a sqrt-cpu-instruction)
sqrt(); function Behind the scenes.
It always checks for the mid-points in a graph.
Example: sqrt(16)=4;
sqrt(4)=2;
Now if you give any input inside 16 or 4 like sqrt(10)==?
It finds the mid point of 2 and 4 i.e = x ,then again it finds the mid point of x and 4 (It excludes lower bound in this input). It repeats this step again and again until it gets the perfect answer i.e sqrt(10)==3.16227766017 .It lies b/w 2 and 4.All this in-built function are created using calculus,differentiation and Integration.
The standard does not specify a particular implementation.
One option is to look at a typical implementation, but you'll probably find it's heavily-optimised assembler.

Questions re: assembly generated from my C++ by gcc

Compiling this code:
int main ()
{
return 0;
}
using:
gcc -S filename.cpp
...generates this assembly:
.file "heloworld.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl $0, %eax
popl %ebp
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
My questions:
Is everything after "." a comment?
What is .LFB0:?
What is .LFE0:?
Why is it so big code only for "int main ()" and "return 0;"?
P.S. I read alot of assembly net books, a lot (at least 30) of tutorials and all I can do is copy code and paste it or rewrite it. Now I'm trying a different approach to try to learn it somehow. The problem is I do understand what are movl, pop, etc, but don't understand how to combine these things to make code "flow". I don't know where or how to correctly start writing a program in asm is. I'm still static not dynamic as in C++ but I want to learn assembly.
As other have said, .file, .text, ... are assembler directives and .LFB0, .LFE0 are local labels. The only instruction in the generated code are:
pushl %ebp
movl %esp, %ebp
movl $0, %eax
popl %ebp
ret
The first two instruction are the function prologue. The frame pointer is stored on the stack and updated. The next intruction store 0 in eax register (i386 ABI states that integer return value are returned via the eax register). The two last instructions are function epilogue. The frame pointer is restored, and then the function return to its caller via the ret instruction.
If you compile your code with -O3 -fomit-frame-pointer, the code will be compiled to just two instructions:
xorl %eax,%eax
ret
The first set eax to 0 (it only takes two bytes to encode, while movl 0,%eax take 5 bytes), and the second is the ret instruction. The frame pointer manipulation is there to ease debugging (it is possible to get backtrace without it, but it is more difficult).
.file, .text, etc are assembler directives.
.LFB0, .LFE0 are local labels, which are normally used as branch destinations within a function.
As for the size, there are really only a few actual instructions - most of the above listing consists of directives, etc. For future reference you might also want to turn up the optimisation level to remove otherwise redudant instructions, i.e. gcc -Wall -O3 -S ....
It's just that there's a lot going on behind your simple program.
If you intend to read assembler outputs, by no means compile C++. Use plain C, the output is far clearer for a number of reasons.

while (1) Vs. for (;;) Is there a speed difference?

Long version...
A co-worker asserted today after seeing my use of while (1) in a Perl script that for (;;) is faster. I argued that they should be the same hoping that the interpreter would optimize out any differences. I set up a script that would run 1,000,000,000 for loop iterations and the same number of while loops and record the time between. I could find no appreciable difference. My co-worker said that a professor had told him that the while (1) was doing a comparison 1 == 1 and the for (;;) was not. We repeated the same test with the 100x the number of iterations with C++ and the difference was negligible. It was however a graphic example of how much faster compiled code can be vs. a scripting language.
Short version...
Is there any reason to prefer a while (1) over a for (;;) if you need an infinite loop to break out of?
Note: If it's not clear from the question. This was purely a fun academic discussion between a couple of friends. I am aware this is not a super important concept that all programmers should agonize over. Thanks for all the great answers I (and I'm sure others) have learned a few things from this discussion.
Update: The aforementioned co-worker weighed in with a response below.
Quoted here in case it gets buried.
It came from an AMD assembly programmer. He stated that C programmers
(the poeple) don't realize that their code has inefficiencies. He said
today though, gcc compilers are very good, and put people like him out
of business. He said for example, and told me about the while 1 vs
for(;;). I use it now out of habit but gcc and especially interpreters
will do the same operation (a processor jump) for both these days,
since they are optimized.
In perl, they result in the same opcodes:
$ perl -MO=Concise -e 'for(;;) { print "foo\n" }'
a <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 2 -e:1) v ->3
9 <2> leaveloop vK/2 ->a
3 <{> enterloop(next->8 last->9 redo->4) v ->4
- <#> lineseq vK ->9
4 <;> nextstate(main 1 -e:1) v ->5
7 <#> print vK ->8
5 <0> pushmark s ->6
6 <$> const[PV "foo\n"] s ->7
8 <0> unstack v ->4
-e syntax OK
$ perl -MO=Concise -e 'while(1) { print "foo\n" }'
a <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 2 -e:1) v ->3
9 <2> leaveloop vK/2 ->a
3 <{> enterloop(next->8 last->9 redo->4) v ->4
- <#> lineseq vK ->9
4 <;> nextstate(main 1 -e:1) v ->5
7 <#> print vK ->8
5 <0> pushmark s ->6
6 <$> const[PV "foo\n"] s ->7
8 <0> unstack v ->4
-e syntax OK
Likewise in GCC:
#include <stdio.h>
void t_while() {
while(1)
printf("foo\n");
}
void t_for() {
for(;;)
printf("foo\n");
}
.file "test.c"
.section .rodata
.LC0:
.string "foo"
.text
.globl t_while
.type t_while, #function
t_while:
.LFB2:
pushq %rbp
.LCFI0:
movq %rsp, %rbp
.LCFI1:
.L2:
movl $.LC0, %edi
call puts
jmp .L2
.LFE2:
.size t_while, .-t_while
.globl t_for
.type t_for, #function
t_for:
.LFB3:
pushq %rbp
.LCFI2:
movq %rsp, %rbp
.LCFI3:
.L5:
movl $.LC0, %edi
call puts
jmp .L5
.LFE3:
.size t_for, .-t_for
.section .eh_frame,"a",#progbits
.Lframe1:
.long .LECIE1-.LSCIE1
.LSCIE1:
.long 0x0
.byte 0x1
.string "zR"
.uleb128 0x1
.sleb128 -8
.byte 0x10
.uleb128 0x1
.byte 0x3
.byte 0xc
.uleb128 0x7
.uleb128 0x8
.byte 0x90
.uleb128 0x1
.align 8
.LECIE1:
.LSFDE1:
.long .LEFDE1-.LASFDE1
.LASFDE1:
.long .LASFDE1-.Lframe1
.long .LFB2
.long .LFE2-.LFB2
.uleb128 0x0
.byte 0x4
.long .LCFI0-.LFB2
.byte 0xe
.uleb128 0x10
.byte 0x86
.uleb128 0x2
.byte 0x4
.long .LCFI1-.LCFI0
.byte 0xd
.uleb128 0x6
.align 8
.LEFDE1:
.LSFDE3:
.long .LEFDE3-.LASFDE3
.LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xe
.uleb128 0x10
.byte 0x86
.uleb128 0x2
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xd
.uleb128 0x6
.align 8
.LEFDE3:
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",#progbits
So I guess the answer is, they're the same in many compilers. Of course, for some other compilers this may not necessarily be the case, but chances are the code inside of the loop is going to be a few thousand times more expensive than the loop itself anyway, so who cares?
There's not much reason to prefer one over the other. I do think that while(1) and particularly while(true) are more readable than for(;;), but that's just my preference.
Using GCC, they both seem to compile to the same assembly language:
L2:
jmp L2
There is no difference according to the standard. 6.5.3/1 has:
The for statement
for ( for-init-statement ; conditionopt ; expressionopt ) statement
is equivalent to
{
for-init-statement
while ( condition ) {
statement
expression ;
}
}
And 6.5.3/2 has:
Either or both of the condition and the expression can be omitted. A missing condition makes the implied while clause equivalent to while(true).
So according to the C++ standard the code:
for (;;);
is exactly the same as:
{
while (true) {
;
;
}
}
for(;;) is one less character to type if you want to go in that direction to optimize things.
The Visual C++ compiler used to emit a warning for
while (1)
(constant expression) but not for
for (;;)
I've continued the practice of preferring for (;;) for that reason, but I don't know if the compiler still does that these days.
Turbo C with this old compilers for(;;) results in faster code then while(1).
Today gcc, Visual C (I think almost all) compilers optimize well, and CPUs with 4.7 MHz are rarely used.
In those days a for( i=10; i; i-- ) was faster than for( i=1; i <=10; i++ ), because compare i is 0, results in a CPU-Zero-Flag conditional Jump. And the Zero-Flag was modified with the last decrement operation ( i-- ), no extra cmp-operation is needed.
call __printf_chk
decl %ebx %ebx=iterator i
jnz .L2
movl -4(%ebp), %ebx
leave
and here with for(i=1; i<=10; i++) with extra cmpl:
call __printf_chk
incl %ebx
cmpl $11, %ebx
jne .L2
movl -4(%ebp), %ebx
leave
For all the people arguing you shouldn't use indefinte while loops, and suggesting daft stuff like using open goto's ( seriously, ouch )
while (1) {
last if( condition1 );
code();
more_code();
last if( condition2 );
even_more_code();
}
Can't really be represented effectively any other way. Not without creating an exit variable and doing black magic to keep it synced.
If you have a penchant for the more goto-esque syntax, use something sane that limits scope.
flow: {
if ( condition ){
redo flow;
}
if ( othercondition ){
redo flow;
}
if ( earlyexit ){
last flow;
}
something(); # doesn't execute when earlyexit is true
}
Ultimately Speed is not that important
Worring about how effective speed wise different looping constructs are is a massive waste of time. Premature optimization through and through. I can't think of any situation I've ever seen where profiling code found bottlenecks in my choice of looping construct.
Generally its the how of the loop and the what of the loop.
You should "optimize" for readability and succinctness, and write whatever is best at explaining the problem to the next poor sucker who finds your code.
If you use the "goto LABEL" trick somebody mentioned, and I have to use your code, be prepared to sleep with one eye open, especially if you do it more than once, because that sort of stuff creates horrifically spaghetti code.
Just because you can create spaghetti code doesn't mean you should
If compiler doesn't do any optimization, for(;;) would always be faster than while(true). This is because while-statement evaluates the condition everytime, but for-statement is an unconditional jump. But if compiler optimizes the control flow, it may generate some opcodes. You can read disassembly code very easily.
P.S. you could write a infinite loop like this:
#define EVER ;;
//...
for (EVER) {
//...
}
From Stroustrup, TC++PL (3rd edition), §6.1.1:
The curious notation for (;;) is the standard way to specify an infinite loop; you could pronounce it "forever". [...] while (true) is an alternative.
I prefer for (;;).
I heard about this once.
It came from an AMD assembly programmer. He stated that C programmers (the people) don't realize that their code has inefficiencies. He said today though, gcc compilers are very good, and put people like him out of business. He said for example, and told me about the while 1 vs for(;;). I use it now out of habit but gcc and especially interpreters will do the same operation (a processor jump) for both these days, since they are optimized.
In an optimized build of a compiled language, there should be no appreciable difference between the two. Neither should end up performing any comparisons at runtime, they will just execute the loop code until you manually exit the loop (e.g. with a break).
Just came across this thread (although quite a few years late).
I think I found the actual reason why "for(;;)" is better than "while(1)".
according to the "barr coding standard 2018"
Kernighan & Ritchie long ago recommended for (;;) , which has the additional benefit
of insuring against the visually-confusing defect of a while (l); referencing a variable ‘l’.
basically, this is not a speed issue but a readability issue. Depending on the font/print of code the number one(1) in a while may look like a lower case letter l.
i.e 1 vs l. (in some fonts these look identical).
So while(1) may look like some while loop dependent on the variable letter L.
while(true) may also work but in some older C and embedded C cases true/false are not yet defined unless stdbool.h is included.
I'm surprised no one has offered the more direct form, corresponding to the desired assembly:
forever:
do stuff;
goto forever;
I am surprised that nobody properly tested for (;;) versus while (1) in perl!
Because perl is interpreted language, the time to run a perl script does not only consist of the execution phase (which in this case is the same) but also of the interpretation phase before execution. Both of these phases have to be taken in account when making a speed comparison.
Luckily perl has a convenient Benchmark module which we can use to implement a benchmark such as follows:
#!/usr/bin/perl -w
use Benchmark qw( cmpthese );
sub t_for { eval 'die; for (;;) { }'; }
sub t_for2 { eval 'die; for (;;) { }'; }
sub t_while { eval 'die; while (1) { }'; }
cmpthese(-60, { for => \&t_for, for2 => \&t_for2, while => \&t_while });
Note that I am testing two different versions of the infinite for loop: one which is shorter than the while loop and another one which has an extra space to make it the same length as the while loop.
On Ubuntu 11.04 x86_64 with perl 5.10.1 I get the following results:
Rate for for2 while
for 100588/s -- -0% -2%
for2 100937/s 0% -- -1%
while 102147/s 2% 1% --
The while loop is clearly the winner on this platform.
On FreeBSD 8.2 x86_64 with perl 5.14.1:
Rate for for2 while
for 53453/s -- -0% -2%
for2 53552/s 0% -- -2%
while 54564/s 2% 2% --
While loop is the winner here too.
On FreeBSD 8.2 i386 with perl 5.14.1:
Rate while for for2
while 24311/s -- -1% -1%
for 24481/s 1% -- -1%
for2 24637/s 1% 1% --
Surprisingly the for loop with an extra space is the fastest choice here!
My conclusion is that the while loop should be used on x86_64 platform if the programmer is optimizing for speed. Obviously a for loop should be used when optimizing for space. My results are unfortunately inconclusive regarding other platforms.
In theory, a completely naive compiler could store the literal '1' in the binary (wasting space) and check to see if 1 == 0 every iteration (wasting time and more space).
In reality, however, even with "no" optimizations, compilers will still reduce both to the same. They may also emit warnings because it could indicate a logical error. For instance, the argument of while could be defined somewhere else and you not realize it's constant.
while(1) is an idiom for for(;;) which is recognized by most compilers.
I was glad to see that perl recognizes until(0), too.
To summarize the for (;;) vs while (1) debate it is obvious that the former was faster in the days of older non-optimizing compilers, that is why you tend to see it in older code bases such as Lions Unix Source code commentary, however in the age of badass optimizing compilers those gains are optimized away coupling that with the fact that the latter is easier to understand than the former I believe that it would be more preferable.
I would think that both are the same in terms of performance. But I would prefer while(1) for readability but I question why you need an infinite loop.
They are the same. There are much more important questions to ponder.
My point which was implied but not explicitly made above, is that a decent compiler would generate the exact same code for both loop forms. The bigger point is that the looping construct is a minor part of the run time of any algorithm, and you must first ensure that you have optimized the algorithm and everything else related to it. Optimizing your loop construct should absolutely be at the bottom of your priority list.