C++: doubles, precision, virtual machines and GCC - c++

I have the following piece of code:
#include <cstdio>
int main()
{
if ((1.0 + 0.1) != (1.0 + 0.1))
printf("not equal\n");
else
printf("equal\n");
return 0;
}
When compiled with O3 using gcc (4.4,4.5 and 4.6) and run natively (ubuntu 10.10), it prints the expected result of "equal".
However the same code when compiled as described above and run on a virtual machine (ubuntu 10.10, virtualbox image), it outputs "not equal" - this is the case for when the O3 and O2 flags are set however not O1 and below. When compiled with clang (O3 and O2) and run upon the virtual machine I get the correct result.
I understand 1.1 can't be correctly represented using double and I've read "What Every Computer Scientist Should Know About Floating-Point Arithmetic" so please don't point me there, this seems to be some kind of optimisation that GCC does that somehow doesn't seem to work in virtual machines.
Any ideas?
Note: The C++ standard says type promotion in this situations is implementation dependent, could it be that GCC is using a more precise internal representation that when the inequality test is applied holds true - due to the extra precision?
UPDATE1: The following modification of the above piece of code, now results in the correct result. It seems at some point, for whatever reason, GCC turns off floating point control word.
#include <cstdio>
void set_dpfpu() { unsigned int mode = 0x27F; asm ("fldcw %0" : : "m" (*&mode));
int main()
{
set_dpfpu();
if ((1.0 + 0.1) != (1.0 + 0.1))
printf("not equal\n");
else
printf("equal\n");
return 0;
}
UPDATE2: For those asking about the const expression nature of the code, I've changed it as follows, and still fails when compiled with GCC. - but i assume the optimizer may be turning the following into a const expression too.
#include <cstdio>
void set_dpfpu() { unsigned int mode = 0x27F; asm ("fldcw %0" : : "m" (*&mode));
int main()
{
//set_dpfpu(); uncomment to make it work.
double d1 = 1.0;
double d2 = 1.0;
if ((d1 + 0.1) != (d2 + 0.1))
printf("not equal\n");
else
printf("equal\n");
return 0;
}
UPDATE3 Resolution: Upgrading virtualbox to version 4.1.8r75467 resolved the issue. However the their remains one issue, that is: why was the clang build working.

UPDATE: See this posting How to deal with excess precision in floating-point computations?
It address the issues of extended floating point precision. I forgot about the extended precision in x86. I remember a simulation that should have been deterministic, but gave different results on Intel CPUs than on PowePC CPUs. The causes was Intel's extended precision architecture.
This Web page talks about how to throw Intel CPUs into double-precision rounding mode: http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html.
Does virtualbox guarantee that its floating point operations are identical to the the hardware's floating point operations? I could not find a guarantee like that with a quick Google search. I also did not find a promise that vituralbox FP ops conform to IEEE 754.
VMs are emulators that try-- and mostly succeed-- to the emulate a particular instruction set or architecture. They are just emulators, however, and subject to their own implementation quirks or design issues.
If you haven't already, post the question forums.virtualbox.org and see what the community says about it.

Yep that is really strange behavior, but it can actually easily be explained:
On x86 floating point registers internally use more precision (eg 80 instead of 64). This means the computation 1.0 + 0.1 will be computed with more precision (and since 1.1 can't be represented exactly in binary at all those extra bits WILL be used) in the registers. Only when storing the result to memory will it be truncated.
What this means is simple: If you compare a value loaded from memory with a value newly computed in the registers you'll get a "non-equal" back, because one value was truncated while the other wasn't. So that has nothing to do with VM/no VM it just depends on the code the compiler generates which can easily fluctuate as we see there.
Add it to the growing list of floating point surprises..

I can confirm the same behaviour of your non-VM code, but since I don't have a VM I haven't tested the VM part.
However, the compiler, both Clang and GCC will evaluate the constant expression at compile time. See the assembly output below (using gcc -O0 test.cpp -S) :
.file "test.cpp"
.section .rodata
.LC0:
.string "equal"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
call puts
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
It looks like you understand assembly, but it's clear that there is only the "equal" string, there is no "not equal". So the comparison is not even done at run time, it just prints "equal".
I would try to code the calculation and comparison using assembly and see if you have the same behavior. If you have different behavior on the VM, then it's the way the VM does the calculation.
UPDATE 1: (Based on the "UPDATE 2" in the original question). Below is the gcc -O0 -S test.cpp output assembly (for 64 bit architecture). In it you can see the movabsq $4607182418800017408, %rax line twice. This will be for the two comparison flags, I haven't verified, but I presume the $4607182418800017408 value is 1.1 in floating point terms. It would be interesting to compile this on the VM, if you get the same result (two similar lines) then the VM will be doing something funny at run-time, otherwise it's a combination of VM and compiler.
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movabsq $4607182418800017408, %rax
movq %rax, -16(%rbp)
movabsq $4607182418800017408, %rax
movq %rax, -8(%rbp)
movsd -16(%rbp), %xmm1
movsd .LC1(%rip), %xmm0
addsd %xmm1, %xmm0
movsd -8(%rbp), %xmm2
movsd .LC1(%rip), %xmm1
addsd %xmm2, %xmm1
ucomisd %xmm1, %xmm0
jp .L6
ucomisd %xmm1, %xmm0
je .L7

I see you added another question:
Note: The C++ standard says type promotion in this situations is implementation dependent, could it be that GCC is using a more precise internal representation that when the inequality test is applied holds true - due to the extra precision?
The answer to that one is no. 1.1 is not exactly representable in a binary format, no matter how many bits the format has. You can get close, but not with an infinite number of zeros after the .1.
Or did you mean an entirely new internal format for decimals? No, I refuse to believe that. It wouldn't be very compatible if it did.

Related

do constant calculations in #define consume resources?

as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources, but that is a different story when we use definitions as macros and put some calculations inside it. take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code, does it spend time for CPU to calculate the SQRT and add operations or this values are calculated in compiler and just replaced in the code?
If it is calculated by CPU, is there anyway to be calculated beforehand in preprocessor and assigned as a constant here?
as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources,
It depends on what you mean by that. Constants appearing in expressions that potentially are evaluated at runtime have to be represented in the program somehow. Under most circumstances, that will take some space.
Even (in C++) a constexpr can be evaluated at runtime, even though implementations can and probably do evaluate them at compile time.
but that is a different story when we use definitions as macros and put some calculations inside it.
Yes, it is different, because macros are not constants in any applicable sense of that term.
take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code,
d is not a variable. It is a macro. Wherever it appears within the scope of those macro definitions, it is exactly equivalent to the expression sqrt(12.2*5.8) appearing at that point.
does it spend time for CPU
to calculate the SQRT and add operations or this values are calculated
in compiler and just replaced in the code?
Either could happen. It depends on your compiler, probably on compilation options, and possibly on other code in the translation unit. Pre-computation is more likely at higher optimization levels.
If it is calculated by CPU,
is there anyway to be calculated beforehand in preprocessor and
assigned as a constant here?
Such a calculation is not part of the semantics of the preprocessor per se. To the somewhat artificial extent that we draw a distinction between preprocessor and compiler in modern C and C++ implementations, if a precomputation is performed then it will be performed by the compiler, not the preprocessor.
The C and C++ languages do not define a mechanism to force such evaluations to be performed at compile time, but you can make it more likely by increasing the compiler's optimization level. Or in C++, there is probably a way to use a template to wrap the expression in a constexpr function computing its value, which would make it very likely to be computed at compile time.
But if a constant is what you want, then you always have the option of pre-computing it manually, and defining the macro to expand to an actual constant.
The C standard doesn't specify whether the calculations required when using d (i.e. sqrt(12.2 * 5.8) is done at compile time or at run time. It's left to the individual compiler to decide.
Example:
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
float foo() {
return d;
}
int main(void)
{
printf("%f\n", foo());
}
may result in (using https://godbolt.org/ and gcc 10.2 -O0)
foo:
pushq %rbp
movq %rsp, %rbp
movss .LC0(%rip), %xmm0
popq %rbp
ret
.LC1:
.string "%f\n"
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
call foo
pxor %xmm1, %xmm1
cvtss2sd %xmm0, %xmm1
movq %xmm1, %rax
movq %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
ret
.LC0:
.long 1090950945
and for -O2
foo:
movss .LC0(%rip), %xmm0
ret
.LC2:
.string "%f\n"
main:
subq $8, %rsp
movl $.LC2, %edi
movl $1, %eax
movsd .LC1(%rip), %xmm0
call printf
xorl %eax, %eax
addq $8, %rsp
ret
.LC0:
.long 1090950945
.LC1:
.long 536870912
.long 1075892964
so in this example we see a compile time calculation. In my experience all the major compilers will do that but it's not a requirement made by the C standard.
To understand this, you need to understand how #define works! #define is not a statement which the compiler even receives. It is a preprocessor directive, which means, it is processed by the preprocessor. #define a b literally replaces all occurrences of a with b. So, as many times as you call sqrt(c), that many times, it will be replaced by sqrt(a*b). Now, the standards do not mention whether something like this will be calculated at runtime or at compile time, and it has been left to the individual compiler to decide. Although, usage of optimisation flags will definitely impact the end result. Not to mention, neither of those variables will be type safe!

Performance of a string operation with strlen vs stop on zero

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution.
I'll take as an example the following two implementations of the upper method:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
INDEX is simple a typedef of uint_fast32_t.
Now I can test the speed of those methods in my main.cpp:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
if (!(_function()) && ok) \
ok = false; \
} \
char output[TEST_RECURSIVE_OUTPUT_STR]; \
sprintf(output, "[%s] Test %s %s: %ld ms\n", \
ok ? "OK" : "Failed", \
TEST_RECURSIVE_BUILD_TYPE, \
#_function, \
(clock() - before) * 1000 / CLOCKS_PER_SEC); \
fprintf(stdout, output); \
fprintf(file_log, output); \
}
String a;
String b;
bool stringUpperStrlen()
{
a.upperStrlen();
return true;
}
bool stringUpperPtr()
{
b.upperPtr();
return true;
}
int main(int argc, char** argv) {
...
a = "Hello World!";
b = "Hello World!";
TEST_RECURSIVE(stringUpperPtr);
TEST_RECURSIVE(stringUpperStrlen);
...
return 0;
}
Then I can compile and test with cmake in Debug or Release with the following results.
[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms
[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms
So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.
So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.
The stringUpperStrlen assembly:
_Z17stringUpperStrlenv:
.LFB72:
.cfi_startproc
pushq %r13
.cfi_def_cfa_offset 16
.cfi_offset 13, -16
xorl %eax, %eax
pushq %r12
.cfi_def_cfa_offset 24
.cfi_offset 12, -24
pushq %rbp
.cfi_def_cfa_offset 32
.cfi_offset 6, -32
xorl %ebp, %ebp
pushq %rbx
.cfi_def_cfa_offset 40
.cfi_offset 3, -40
pushq %rcx
.cfi_def_cfa_offset 48
orq $-1, %rcx
movq a#GOTPCREL(%rip), %r13
movq 0(%r13), %rdi
repnz scasb
movq %rcx, %rdx
notq %rdx
leaq -1(%rdx), %rbx
.L4:
cmpq %rbp, %rbx
je .L3
movq 0(%r13), %r12
addq %rbp, %r12
movsbl (%r12), %edi
incq %rbp
call toupper#PLT
movb %al, (%r12)
jmp .L4
.L3:
popq %rdx
.cfi_def_cfa_offset 40
popq %rbx
.cfi_def_cfa_offset 32
popq %rbp
.cfi_def_cfa_offset 24
popq %r12
.cfi_def_cfa_offset 16
movb $1, %al
popq %r13
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE72:
.size _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
.globl _Z14stringUpperPtrv
.type _Z14stringUpperPtrv, #function
The stringUpperPtr assembly:
_Z14stringUpperPtrv:
.LFB73:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq b#GOTPCREL(%rip), %rax
movq (%rax), %rbx
.L9:
movsbl (%rbx), %edi
testb %dil, %dil
je .L8
call toupper#PLT
movb %al, (%rbx)
incq %rbx
jmp .L9
.L8:
movb $1, %al
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE73:
.size _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
.section .rodata.str1.1,"aMS",#progbits,1
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
So how do you explain this difference in performance?
Thanks in advance.
EDIT:
CMake generate something like this command to compile:
/bin/g++-8 -Os -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o xpp-tests libxpp.so
/bin/g++-8 -O3 -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o Release/xpp-tests Release/libxpp.so
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16
# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG -Wall -pipe -fPIC -march=native -fno-strict-aliasing
CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1
The define TEST_RECURSIVE will call _function 1000000 times in my examples.
You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x; take on a PC?
If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
It's even possible that x is currently in swap, so that reading it requires reading from a disk.
And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
je .L3: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi: Read the current character from the string in memory.
incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper#PLT
movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
jmp .L4: Unconditional jump to the beginning of the loop.
For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:
movsbl (%rbx), %edi: read from the address containing the current.
testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
je .L8: conditional jump, to exit the loop if the character is zero.
call toupper#PLT
movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Remove needless assembler statements from g++ output

I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.
Question:
Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?
is there a reason the compiler cannot detect these cases with -O0 or -O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.
I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.
Why is the subq $8, %rsp statement generated at all?
Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

Where is the one to one correlation between the assembly and cpp code?

I tried to examine how the this code will be in assembly:
int main(){
if (0){
int x = 2;
x++;
}
return 0;
}
I was wondering what does if (0) mean?
I used the shell command g++ -S helloWorld.cpp in Linux
and got this code:
.file "helloWorld.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
I expected that the assembly will contain some JZ but where is it?
How can I compile the code without optimization?
There is no direct, guaranteed relationship between C++ source code and
the generated assembler. The C++ source code defines a certain
semantics, and the compiler outputs machine code which will implement
the observable behavior of those semantics. How the compiler does this,
and the actual code it outputs, can vary enormously, even over the same
underlying hardware; I would be very disappointed in a compiler which
generated code which compared 0 with 0, and then did a conditional
jump if the results were equal, regardless of what the C++ source code
was.
In your example, the only observable behavior in your code is to return
0 to the OS. Anything the compiler generates must do this (and have
no other observable behavior). The code you show isn't optimal for
this:
xorl %eax, %eax
ret
is really all that is needed. But of course, the compiler is free to
generate a lot more if it wants. (Your code, for example, sets up a
frame to support local variables, even though there aren't any. Many
compilers do this systematically, because most debuggers expect it, and
get confused if there is no frame.)
With regards to optimization, this depends on the compiler. With g++,
-O0 (that's the letter O followed by the number zero) turns off all
optimization. This is the default, however, so it is effectively what
you are seeing. In addition to having several different levels of
optimization, g++ supports turning individual optimizations off or on.
You might want to look at the complete list:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Optimize-Options.html#Optimize-Options.
The compiler eliminates that code as dead code, e.g. code that will never run. What you're left with is establishing the stack frame and setting the return value of the function. if(0) is never true, after all. If you want to get JZ, then you should probably do something like if(variable == 0). Keep in mind that the compiler is in no way required to actually emit the JZ instruction, it may use any other means to achieve the same thing. Compiling a high level language to assembly is very rarely a clear, one-to-one correlation.
The code has probably been optimized.
if (0){
int x = 2;
x++;
}
has been eliminated.
movl $0, %eax is where the return value been set. It seems the other instructions are just program init and exit.
There is a possibility that the compiler optimized it away, since it's never true.
The optimizer removed the if conditional and all of the code inside, so it doesn't show up at all.
the if (0) {} block has been optimized out by the compiler, as this will never be called.
so your function do only return 0 (movl $0, %eax)

how does the std::sqrt() function work? [duplicate]

This question already has answers here:
How is the square root function implemented? [closed]
(15 answers)
Closed 4 years ago.
Does anyone know how the std::sqrt() function works? (or at least have an idea?)
I've seen methods on the internet that seemed really slow, using lots of approximations and iterations.
Everyone knows sqrt() function is slow, but I'd like to know how the one from std works so I could have a vague idea of when it is beneficial to avoid it. (yes, if I want to be sure I can profile, but it's still nice to have a vague idea)
EDIT: Didn't really formulate the question too well... What I'm interested in:
how would the fastest C++ function calculating a square root look like? (more or less, I just want to know the actual logic behind it)
Nowadays, on modern machines, floating point functions are passed off to the hardware (floating point unit or math-coprocessor).
Sometimes, it uses what the CPU offers:
$ cat main.cc
#include <cmath>
#include <ctime>
#include <cstdlib>
int main(){
srand (clock());
const double d = rand();
return std::sqrt(d) > 2 ? 1 : 0;
}
(the blahblah is just so nothing relevant is optimized away, don't run that program!)
$ g++ -S main.cc
$ cat main.s
.file "main.cc"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB106:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
call clock
movl %eax, %edi
call srand
call rand
cvtsi2sd %eax, %xmm1
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L5
.L2:
xorl %eax, %eax
ucomisd .LC0(%rip), %xmm0
seta %al
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
movapd %xmm1, %xmm0
call sqrt
jmp .L2
.cfi_endproc
.LFE106:
.size main, .-main
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 0
.long 1073741824
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",#progbits
(hint: it is using a sqrt-cpu-instruction)
sqrt(); function Behind the scenes.
It always checks for the mid-points in a graph.
Example: sqrt(16)=4;
sqrt(4)=2;
Now if you give any input inside 16 or 4 like sqrt(10)==?
It finds the mid point of 2 and 4 i.e = x ,then again it finds the mid point of x and 4 (It excludes lower bound in this input). It repeats this step again and again until it gets the perfect answer i.e sqrt(10)==3.16227766017 .It lies b/w 2 and 4.All this in-built function are created using calculus,differentiation and Integration.
The standard does not specify a particular implementation.
One option is to look at a typical implementation, but you'll probably find it's heavily-optimised assembler.