I'm trying to experiment with code by myself here on different compilers.
I've been trying to lookup the advantages of disabling exceptions on certain functions (via the binary footprint) and to compare that to functions that don't disable exceptions, and I've actually stumbled onto a weird case where it's better to have exceptions than not.
I've been using Matt Godbolt's Compiler Explorer to do these checks, and it was checked on x86-64 clang 12.0.1 without any flags (on GCC this weird behavior doesn't exist).
Looking at this simple code:
auto* allocated_int()
{
return new int{};
}
int main()
{
delete allocated_int();
return 0;
}
Very straight-forward, pretty much deletes an allocated pointer returned from the function allocated_int().
As expected, the binary footprint is minimal, as well:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov rax, rcx
mov dword ptr [rcx], 0
pop rbp
ret
Also, very straight-forward.
But the moment I apply the noexcept keyword to the allocated_int() function, the binary bloats. I'll apply the resulting assembly here:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov qword ptr [rbp - 8], rcx # 8-byte Spill
jmp .LBB0_1
.LBB0_1:
mov rcx, qword ptr [rbp - 8] # 8-byte Reload
mov rax, rcx
mov dword ptr [rcx], 0
add rsp, 16
pop rbp
ret
mov rdi, rax
call __clang_call_terminate
__clang_call_terminate: # #__clang_call_terminate
push rax
call __cxa_begin_catch
call std::terminate()
Why is clang doing this extra code for us? I didn't request any other action but calling new(), and I was expecting the binary to reflect that.
Thank you for those who can explain!
Why is clang doing this extra code for us?
Because the behaviour of the function is different.
I didn't request any other action but calling new()
By declaring the function noexcept, you've requested std::terminate to be called in case an exception propagates out of the function.
allocated_int in the first program never calls std::terminate, while
allocated_int in the second program may call std::terminate. Note that the amount of added code is much less if you remember to enable the optimiser. Comparing non-optimised assembly is mostly futile.
You can use non-throwing allocation to prevent that:
return new(std::nothrow) int{};
It's indeed an astute observation that doing potentially throwing things inside non-throwing function can introduce some extra work that wouldn't need to be done if the same things were done in a potentially throwing function.
I've been trying to lookup the advantages of disabling exceptions on certain functions
The advantage of using non-throwing is potentially realised where such function is called; not within the function itself.
Without nothrow, your function just acts as a front end to the allocation function you call. It doesn't have any real behavior of its own. In fact, in a real executable, if you do link-time optimization there's a pretty good chance that it'll completely disappear.
When you add noexcept, your code is silently transformed into something roughly like this:
auto* allocated_int()
{
try {
return new int{};
}
catch(...) {
terminate();
}
}
The extra code you see generated is what's needed to catch the exception and call terminate when/if needed.
Related
Here is the function definition
const int& test_const_ref(const int& a) {
return a;
}
and calling it from main
int main() {
auto& x = test_const_ref(1);
printf("%d, %p\n", x, &x);
}
output as following
./debug/main
>>> 1, 0x7ffee237285c
and here is the disassembly code of test_const_ref
test_const_ref(int const&):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -0x8(%rbp)
movq -0x8(%rbp), %rax
popq %rbp
retq
The question is: where does the variable x alias or where is the number 1 I passed to function test_const_ref stored ?
The code exhibits undefined behavior - the function test_const_ref returns a reference to a temporary, which lives until the end of the full-expression (the ;), and any dereference of it afterwards accesses a dangling reference.
Appearing to work is a common manifestation of UB. The program is still wrong. With optimization on, for example, Clang 12 -O2 prints: 0.
Note - there's no error in the function test_const_ref itself (apart from a design error). The UB is in main, where the dereference of the dangling int& happens during a call to printf.
Where the temporary int is stored exactly is implementation detail - but in many cases (in a Debug build, when a function isn't inlined), it would be stored on the stack:
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov dword ptr [rbp - 12], 1 # Here the 1 is stored in the stack frame
lea rdi, [rbp - 12]
call test_const_ref(int const&)
mov qword ptr [rbp - 8], rax
mov rax, qword ptr [rbp - 8]
mov esi, dword ptr [rax]
mov rdx, qword ptr [rbp - 8]
movabs rdi, offset .L.str
mov al, 0
call printf
So any subsequent use of the returned reference will access memory at [rbp - 12], that may already have been re-used for other purposes.
Note also that the compiler doesn't actually generate assembly from C++ code; it merely uses the C++ code to understand the intent, and generates another program that produces the intended output. This is known as the as-if rule. In the presence of undefined behavior, the compiler becomes free from this restriction, and may generate any output, rendering the program meaningless.
Good answers are already been given, but wrapperm explained this topic very well in here. It's going to be stored on the stack in most implementations i'm aware of.
1. The function
The language doesn't define where arguments to functions are stored. Different ABIs, for different platforms, define this.
Typically, a function argument, before any optimization, is stored on the stack. A reference is no different in this respect. What's actually stored would be a pointer to the refered-to object. Think of it this way:
const int* test_const_ref(const int* a) {
return a;
}
2. The temporary
If you were to declare a variable int foo; and call test_const_ref(foo), you know that the result would refer to foo. Since you're calling it with a temporary, all bets are off: As #fabian notes in a comment, the language only guarantees the value exist until the end of the assignment statement. Afterwards
In practice, and in your case: A compiler which allocates stack space for the temporary integer 1 will have x refer to that place, and will not use it for something else before x is defined. But if your compiler optimizes that stack allocation away - e.g. passes 1 via a register - then x has nothing to refer to and may hold junk. It might even be undefined behavior (not quite sure about that). If you're lucky, you'll get a compiler warning about it (GodBolt.org).
Let's assume I pass a new object to a function like this:
loadContainer->addControlView( new BmpView( BMP_PICTURE ) );
Now, I want to change a specific characteristic of the BmpView before I pass it to addControlView. The way I do this is like this:
Control* newView = new BmpView( BMP_PICTURE );
newView->changeColor( WHITE );
loadContainer->addControlView( newView );
Does this create an extra temporary/local object? Or is there an equal amount of memory allocated in both cases?
The only added memory allocated in your function is a new pointer *newView, which its size is pretty low and doesn't affect by the actual size of the BmpView. It doesn't allocate twice memory for BmpView.
I'm not considering any memory overhead of calling changeColor, which I assume wasn't the point of this question.
In both cases, there is single call to new, hence the amount of memory used is equal. (Note, this is without speculating about any allocation possibly requested by BmpView constructor, changeColor(), etc.)
However, you may wish to refactor your code to ensure some exception safety, avoid potential leaks hence ensure the amount of memory used is under control:
// C++11
std::unique_ptr<Control> newView(new BmpView(BMP_PICTURE));
// C++14, preferred
//auto newView = std::make_unique<BmpView>(BMP_PICTURE);
newView->changeColor( WHITE );
loadContainer->addControlView( newView.release() );
Reference for the below code/assembly :
https://godbolt.org/g/8DgmC1
#include <cstdio>
class ValueClass {
public:
// Class content not important...
int someValue;
};
void PrintValueClass(ValueClass* ptr) {
printf("%d\n", ptr->someValue);
}
int main() {
PrintValueClass(new ValueClass());
ValueClass* pValueClass = new ValueClass();
pValueClass->someValue = 55;
PrintValueClass(pValueClass);
return 1;
}
Compiled Assembly (PrintValueClass redacted as not important to question at hand) :
Example where you pass the (new ValueClass) directly to the function.
main:
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
pop rbp
ret
Example where you create a local variable holding the pointer, do something to it, and then pass it to the function.
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 55
mov rax, QWORD PTR [rbp-8]
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
leave
ret
Before diving into the assembly, if your question is does the new operation occur twice if you store the pointer in a variable first, the answer is no. As shown through the assembly, the new 'function' is only called once, so only sizeof(ValueClass) is ever being allocated here through some sort of heap allocation function. But I feel the need to answer the question fully, even if it was not exactly intended to ask this question. Is extra memory used? Technically yes, realistically no.
The only difference between these two pieces of code is stack 'allocation', noted by the sub rsp, 16, which essentially means 'allocate' 16 bytes on the stack for local variables. So truly the only difference here is 16 bytes, which will greatly change on what compiler you use, what architecture you target, and many more factors.
At the end of the day, I would go as far to say, you would never care about the extra 16 bytes.
Is reference is compiled as usual pointer or it has other stuff behind?
And how does it differ in clang?
You can think of a reference as an immutable pointer that is automatically de-referenced on usage. This isn't what the C++ standard says, so you cannot rely on that being an actual implementation.
Practically speaking though, it likely to be what you see in many cases.
Take the following example in the case of parameter passing:
#include <stdio.h>
void function (int *const n){
printf("%d",*n);
}
void function (int & n){
printf("%d",n);
}
int main(){
int n = 123;
function(&n);
function(n);
}
Both gcc and clang produce identical code for the functions without any optimizations enabled:
function(int*):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
function(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
How does reference translates to asm in gcc?
In general: It depends.
To find that out in a specific case, you can test by reading the generated assembly.
Is reference is compiled as usual pointer or it has other stuff behind?
The implementation of code using a reference is practically identical to one using a pointer to achieve identical indirection. How each are implemented, is not guaranteed by the standard, but there is no need for implementing them differently.
References only differ from pointers in the way that they are allowed to use by the rules of C++. Of course, because the rules are different, pointers can be used in a way that references can not. And in such case you cannot compare whether pointers generate the same assembly.
The limitations of a reference might make some optimizations easier, so there might be a difference, but such optimization would also have been possible with pointers, so there is no guarantee of different assembly output when using references instead of pointers.
And how does it differ in clang?
In general: It depends.
Both compilers are bound by the same rules of the standard. They might generate identical assembly or different. How the generated assembly of particular version of one compiler differs (if at all) from the assembly generated by a particular version of another compiler, with particular compilation options for each, on particular processor architecture on particular operating system, in a particular use case of a reference, can be found by inspecting and comparing the generated assembly in each particular case.
while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to
int a(int* a, int* b) {
if (b - a < 2) return *a = ~*a;
// register intensive code here e.g. sorting network
}
compiles (-O2/-O3) this into somthing like this:
push r15
mov rax, rcx
push r14
sub rax, rdx
push r13
push r12
push rbp
push rbx
sub rsp, 184
mov QWORD PTR [rsp], rdx
cmp rax, 7
jg .L95
not DWORD PTR [rdx]
.L162:
add rsp, 184
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:
mov rax, rcx
sub rax, rdx
cmp rax, 7
jg .L74
not DWORD PTR [rdx]
ret
.L74:
Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.
Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?
Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive
This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.
Pulling the early-out check into a wrapper is often useful when it's small enough to inline.
Looks like modern gcc can actually sidestep this compiler limitation sometimes.
Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square() (which ends with jmp square(int*, int*) [clone .part.3] if the early-out return path isn't taken).
code on Godbolt, note I added -std=gnu++14, which is required for clang to compiler your code.
void square_inlinewrapper(int* a, int* b) {
//if (b - a < 16) return; // gcc inlines this part for us, and calls a private clone of the function!
return square(a, b);
}
# gcc6.1 -O2 (default / generic -march= and -mtune=)
mov rax, rsi
sub rax, rdi
cmp rax, 63
jg .L9
rep ret
.L9:
jmp square(int*, int*) [clone .part.3]
square() itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.
Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.
Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).
You might want to name the wrapper square, and rename just the main body to a private name (like static void square_impl).
What is difference between
int x=7;
and
register int x=7;
?
I am using C++.
register is a hint to the compiler, advising it to store that variable in a processor register instead of memory (for example, instead of the stack).
The compiler may or may not follow that hint.
According to Herb Sutter in "Keywords That Aren't (or, Comments by Another Name)":
A register specifier has the same
semantics as an auto specifier...
According to Herb Sutter, register is "exactly as meaningful as whitespace" and has no effect on the semantics of a C++ program.
In C++ as it existed in 2010, any program which is valid that uses the keywords "auto" or "register" will be semantically identical to one with those keywords removed (unless they appear in stringized macros or other similar contexts). In that sense the keywords are useless for properly-compiling programs. On the other hand, the keywords might be useful in certain macro contexts to ensure that improper usage of a macro will cause a compile-time error rather than producing bogus code.
In C++11 and later versions of the language, the auto keyword was re-purposed to act as a pseudo-type for objects which are initialized, which a compiler will automatically replace with the type of the initializing expression. Thus, in C++03, the declaration: auto int i=(unsigned char)5; was equivalent to int i=5; when used within a block context, and auto i=(unsigned char)5; was a constraint violation. In C++11, auto int i=(unsigned char)5; became a constraint violation while auto i=(unsigned char)5; became equivalent to auto unsigned char i=5;.
With today's compilers, probably nothing. Is was orginally a hint to place a variable in a register for faster access, but most compilers today ignore that hint and decide for themselves.
register is deprecated in C++11. It is unused and reserved in C++17.
Source: http://en.cppreference.com/w/cpp/keyword/register
Almost certainly nothing.
register is a hint to the compiler that you plan on using x a lot, and that you think it should be placed in a register.
However, compilers are now far better at determining what values should be placed in registers than the average (or even expert) programmer is, so compilers just ignore the keyword, and do what they wants.
The register keyword was useful for:
Inline assembly.
Expert C/C++ programming.
Cacheable variables declaration.
An example of a productive system, where the register keyword was required:
typedef unsigned long long Out;
volatile Out out,tmp;
Out register rax asm("rax");
asm volatile("rdtsc":"=A"(rax));
out=out*tmp+rax;
It has been deprecated since C++11 and is unused and reserved in C++17.
As of gcc 9.3, compiling using -std=c++2a, register produces a compiler warning, but it still has the desired effect and behaves identically to C's register when compiling without -O1–-Ofast optimisation flags in the respect of this answer. Using clang++-7 causes a compiler error however. So yes, register optimisations only make a difference on standard compilation with no optimisation -O flags, but they're basic optimisations that the compiler would figure out even with -O1.
The only difference is that in C++, you are allowed to take the address of the register variable which means that the optimisation only occurs if you don't take the address of the variable or its aliases (to create a pointer) or take a reference of it in the code (only on - O0, because a reference also has an address, because it's a const pointer on the stack, which, like a pointer can be optimised off the stack if compiling using -Ofast, except they will never appear on the stack using -Ofast, because unlike a pointer, they cannot be made volatile and their addresses cannot be taken), otherwise it will behave like you hadn't used register, and the value will be stored on the stack.
On -O0, another difference is that const register on gcc C and gcc C++ do not behave the same. On gcc C, const register behaves like register, because block-scope consts are not optimised on gcc. On clang C, register does nothing and only const block-scope optimisations apply. On gcc C, register optimisations apply but const at block-scope has no optimisation. On gcc C++, both register and const block-scope optimisations combine.
#include <stdio.h> //yes it's C code on C++
int main(void) {
const register int i = 3;
printf("%d", i);
return 0;
}
int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
register int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
mov rbx, QWORD PTR [rbp-8] //callee restoration
leave
ret
const int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3 //still saves to stack
mov esi, 3 //immediate substitution
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
const register int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
mov esi, 3 //loads straight into esi saving rbx push/pop and extra indirection (because C++ block-scope const is always substituted immediately into the instruction)
mov edi, OFFSET FLAT:.LC0 // can't optimise away because printf only takes const char*
mov eax, 0 //zeroed: https://stackoverflow.com/a/6212755/7194773
call printf
mov eax, 0 //default return value of main is 0
pop rbp //nothing else pushed to stack -- more efficient than leave (rsp == rbp already)
ret
register tells the compiler to 1)store a local variable in a callee saved register, in this case rbx, and 2)optimise out stack writes if address of variable is never taken. const tells the compiler to substitute the value immediately (instead of assigning it a register or loading it from memory) and write the local variable to the stack as default behaviour. const register is the combination of these emboldened optimisations. This is as slimline as it gets.
Also, on gcc C and C++, register on its own seems to create a random 16 byte gap on the stack for the first local on the stack, which doesn't happen with const register.
Compiling using -Ofast however; register has 0 optimisation effect because if it can be put in a register or made immediate, it always will be and if it can't it won't be; const still optimises out the load on C and C++ but at file scope only; volatile still forces the values to be stored and loaded from the stack.
.LC0:
.string "%d"
main:
//optimises out push and change of rbp
sub rsp, 8 //https://stackoverflow.com/a/40344912/7194773
mov esi, 3
mov edi, OFFSET FLAT:.LC0
xor eax, eax //xor 2 bytes vs 5 for mov eax, 0
call printf
xor eax, eax
add rsp, 8
ret
Consider a case when compiler's optimizer has two variables and is forced to spill one onto stack. It so happened that both variables have the same weight to the compiler. Given there is no difference, the compiler will arbitrarily spill one of the variables. On the other hand, the register keyword gives compiler a hint which variable will be accessed more frequently. It is similar to x86 prefetch instruction, but for compiler optimizer.
Obviously register hints are similar to user-provided branch probability hints, and can be inferred from these probability hints. If compiler knows that some branch is taken often, it will keep branch related variables in registers. So I suggest caring more about branch hints, and forgetting about register. Ideally your profiler should communicate somehow with the compiler and spare you from even thinking about such nuances.