In my rendering code, I pass string literals as a parameter to my rendering blocks so I can identify them in debug dumps and profiling. It looks similar to this:
// rendering client
draw.begin("east abbey foyer");
// rendering code
draw.end();
void Draw::begin(const char * debugName){
#ifdef DEBUG
// code that uses debugName
#endif
// the rest of the code doesn't use debugName at all
}
In the final program, I wouldn't want these strings to be around anymore. But I also want to avoid using macros in my rendering client code to do this; in fact, I would want the rendering client code to KEEP the strings (in the code itself) but not actually compile them into the final program.
So what I'm wondering is, if I change the code of draw.begin(const char*) to not use it's parameter at all, will my compiler optimize that parameter and it's associated costs away (perhaps even going so far as to exclude it from the string table)?
Beware of anyone giving you concrete answers to questions like this. The only way you can be really certain what your compiler is doing in your configuration is to look at the results.
One of the easiest ways to do this is to step through the disassembly in the debugger.
Or you can get the compiler to output an assembly listing (see How to view the assembly behind the code using Visual C++?) and examine exactly what it is really doing.
In practice, it depends on whether the compiler has access to the source code of the implementation of Draw::begin()
here's an illustration:
#include <iostream>
void do_nothing(const char* txt)
{
}
int main()
{
using namespace std;
do_nothing("hello");
return 0;
}
compile with -O3:
yields:
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc
i.e. the string and the call to do_nothing is entirely optimised away
However, define do_nothing in another module and we get:
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
leaq L_.str(%rip), %rdi
callq __Z10do_nothingPKc
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "hello"
i.e. a redundant load, call and the address of the string is passed.
So if you want to optimise debug info away you'll want to implement Draw::begin() inline (in the same way the stl does).
Yes, you could muck about with optimization and that may or may not work depending on the version of your compiler etc.
A much better way to do this is explicitly, not with MACROs (god forbid), but with a simple function that either generates the tag you want for debug purposes or does nothing for production code.
You want to look at std::assert and perhaps use it.
The possibility of doing so will depend on visibility of function in the call site. If function body is not visible, optimization discussed will not be possible at all. If function is visible, than yes - you can not even guarantee there will be a function call!
Related
as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources, but that is a different story when we use definitions as macros and put some calculations inside it. take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code, does it spend time for CPU to calculate the SQRT and add operations or this values are calculated in compiler and just replaced in the code?
If it is calculated by CPU, is there anyway to be calculated beforehand in preprocessor and assigned as a constant here?
as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources,
It depends on what you mean by that. Constants appearing in expressions that potentially are evaluated at runtime have to be represented in the program somehow. Under most circumstances, that will take some space.
Even (in C++) a constexpr can be evaluated at runtime, even though implementations can and probably do evaluate them at compile time.
but that is a different story when we use definitions as macros and put some calculations inside it.
Yes, it is different, because macros are not constants in any applicable sense of that term.
take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code,
d is not a variable. It is a macro. Wherever it appears within the scope of those macro definitions, it is exactly equivalent to the expression sqrt(12.2*5.8) appearing at that point.
does it spend time for CPU
to calculate the SQRT and add operations or this values are calculated
in compiler and just replaced in the code?
Either could happen. It depends on your compiler, probably on compilation options, and possibly on other code in the translation unit. Pre-computation is more likely at higher optimization levels.
If it is calculated by CPU,
is there anyway to be calculated beforehand in preprocessor and
assigned as a constant here?
Such a calculation is not part of the semantics of the preprocessor per se. To the somewhat artificial extent that we draw a distinction between preprocessor and compiler in modern C and C++ implementations, if a precomputation is performed then it will be performed by the compiler, not the preprocessor.
The C and C++ languages do not define a mechanism to force such evaluations to be performed at compile time, but you can make it more likely by increasing the compiler's optimization level. Or in C++, there is probably a way to use a template to wrap the expression in a constexpr function computing its value, which would make it very likely to be computed at compile time.
But if a constant is what you want, then you always have the option of pre-computing it manually, and defining the macro to expand to an actual constant.
The C standard doesn't specify whether the calculations required when using d (i.e. sqrt(12.2 * 5.8) is done at compile time or at run time. It's left to the individual compiler to decide.
Example:
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
float foo() {
return d;
}
int main(void)
{
printf("%f\n", foo());
}
may result in (using https://godbolt.org/ and gcc 10.2 -O0)
foo:
pushq %rbp
movq %rsp, %rbp
movss .LC0(%rip), %xmm0
popq %rbp
ret
.LC1:
.string "%f\n"
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
call foo
pxor %xmm1, %xmm1
cvtss2sd %xmm0, %xmm1
movq %xmm1, %rax
movq %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
ret
.LC0:
.long 1090950945
and for -O2
foo:
movss .LC0(%rip), %xmm0
ret
.LC2:
.string "%f\n"
main:
subq $8, %rsp
movl $.LC2, %edi
movl $1, %eax
movsd .LC1(%rip), %xmm0
call printf
xorl %eax, %eax
addq $8, %rsp
ret
.LC0:
.long 1090950945
.LC1:
.long 536870912
.long 1075892964
so in this example we see a compile time calculation. In my experience all the major compilers will do that but it's not a requirement made by the C standard.
To understand this, you need to understand how #define works! #define is not a statement which the compiler even receives. It is a preprocessor directive, which means, it is processed by the preprocessor. #define a b literally replaces all occurrences of a with b. So, as many times as you call sqrt(c), that many times, it will be replaced by sqrt(a*b). Now, the standards do not mention whether something like this will be calculated at runtime or at compile time, and it has been left to the individual compiler to decide. Although, usage of optimisation flags will definitely impact the end result. Not to mention, neither of those variables will be type safe!
I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.
Question:
Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?
is there a reason the compiler cannot detect these cases with -O0 or -O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.
I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.
Why is the subq $8, %rsp statement generated at all?
Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly
In my rendering code, I pass string literals as a parameter to my rendering blocks so I can identify them in debug dumps and profiling. It looks similar to this:
// rendering client
draw.begin("east abbey foyer");
// rendering code
draw.end();
void Draw::begin(const char * debugName){
#ifdef DEBUG
// code that uses debugName
#endif
// the rest of the code doesn't use debugName at all
}
In the final program, I wouldn't want these strings to be around anymore. But I also want to avoid using macros in my rendering client code to do this; in fact, I would want the rendering client code to KEEP the strings (in the code itself) but not actually compile them into the final program.
So what I'm wondering is, if I change the code of draw.begin(const char*) to not use it's parameter at all, will my compiler optimize that parameter and it's associated costs away (perhaps even going so far as to exclude it from the string table)?
Beware of anyone giving you concrete answers to questions like this. The only way you can be really certain what your compiler is doing in your configuration is to look at the results.
One of the easiest ways to do this is to step through the disassembly in the debugger.
Or you can get the compiler to output an assembly listing (see How to view the assembly behind the code using Visual C++?) and examine exactly what it is really doing.
In practice, it depends on whether the compiler has access to the source code of the implementation of Draw::begin()
here's an illustration:
#include <iostream>
void do_nothing(const char* txt)
{
}
int main()
{
using namespace std;
do_nothing("hello");
return 0;
}
compile with -O3:
yields:
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc
i.e. the string and the call to do_nothing is entirely optimised away
However, define do_nothing in another module and we get:
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
leaq L_.str(%rip), %rdi
callq __Z10do_nothingPKc
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "hello"
i.e. a redundant load, call and the address of the string is passed.
So if you want to optimise debug info away you'll want to implement Draw::begin() inline (in the same way the stl does).
Yes, you could muck about with optimization and that may or may not work depending on the version of your compiler etc.
A much better way to do this is explicitly, not with MACROs (god forbid), but with a simple function that either generates the tag you want for debug purposes or does nothing for production code.
You want to look at std::assert and perhaps use it.
The possibility of doing so will depend on visibility of function in the call site. If function body is not visible, optimization discussed will not be possible at all. If function is visible, than yes - you can not even guarantee there will be a function call!
I was looking at the dissasembly of a function call and found this:
movq %rsp, %rbp
pushq %rbx
subq $136, %rsp ; Pad the stack
....
addq $136, %rsp ; Unpad the stack
popq %rbx
popq %rbp
ret
What is the value of doing this?
That's the space for local variables, not padding.
The compiler will create that stack space for any register spills and local variables it has to store while running this function.
You could see some padding, when disassembling x86-64 code with the SysV ABI (most things that aren't Windows, I don't know how it is in the latter), since function calls have to have the stack aligned at 16 bytes. But in this case it's actually reserving space for local variables.
You might want to look at this or look for more information on how compilers work.
I've got a library that was compiled without C++11 flags (-std=c++11) and an application linking to that library which was built with -std=c++11. It calls a function in the library and the program then crashes much deeper within the library. I've found that the disassembly of the function (which is just a simple function that returns a pointer within a class) where the crash occurs the is different between when the callstack originates from this program as opposed to the library's test program which also wasn't built with C++11 flags.
The OS is OS X Mountain Lion and the compiler is Clang++.
Why is there an incapability between the C++11 app and non-C++11 library and also when is the disassembly generated when the differing generated code is inside the library and thus should be the same?
The two different disassemblies:
TestApplication`Core::GetPointer() const at System.h:xxx:
0x100009690: pushq %rbp
0x100009691: movq %rsp, %rbp
0x100009694: movq %rdi, -8(%rbp)
0x100009698: movq -8(%rbp), %rdi
0x10000969c: movq 64(%rdi), %rax ;<-------Difference
0x1000096a0: popq %rbp
0x1000096a1: ret
Lib1Prototype`Core::GetPointer() const at System.h:xxx:
0x100019c10: pushq %rbp
0x100019c11: movq %rsp, %rbp
0x100019c14: movq %rdi, -8(%rbp)
0x100019c18: movq -8(%rbp), %rdi
0x100019c1c: movq 40(%rdi), %rax ;<------Difference
0x100019c20: popq %rbp
0x100019c21: ret
In the x86-64 ABI used by OS X, the first argument to a function is passed in %rdi, and the function's return value is passed in %rax.* So this function takes a pointer to some data structure, and returns the 64-bit value contained starting at offset 64 or 40, depending on how the function was compiled.
So you need to look at the header file that defines that data structure. It's defining the data structure differently depending on whether you're compiling as C++11. Maybe there's something obvious, like a #ifdef that you know is defined differently. Or maybe there's a member whose type is defined differently. If you can't figure it out, edit your question and paste in the definition of the data structure that's being passed (by pointer) to the Core::GetPointer function.