C++ inline optimisation [duplicate] - c++

This question already has answers here:
When should I write the keyword 'inline' for a function/method?
(16 answers)
Closed 5 years ago.
See my example here OR
the following c++ & matching assembly
IMPLICIT INLINE
#include <iostream>
int func(int i)
{
return i * i;
}
int main(int argc, char *argv[]) {
auto value = atoi(argv[1]);
std::cout << func(value);
value = atoi(argv[2]);
std::cout << func(value);
return 1;
}
results in
func(int):
imul edi, edi
mov eax, edi
ret
main:
push rbx
mov rdi, QWORD PTR [rsi+8]
mov rbx, rsi
mov edx, 10
xor esi, esi
call strtol
imul eax, eax
mov edi, OFFSET FLAT:std::cout
mov esi, eax
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov rdi, QWORD PTR [rbx+16]
mov edx, 10
xor esi, esi
call strtol
imul eax, eax
mov edi, OFFSET FLAT:std::cout
mov esi, eax
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov eax, 1
pop rbx
ret
_GLOBAL__sub_I__Z4funci:
sub rsp, 8
mov edi, OFFSET FLAT:std::__ioinit
call std::ios_base::Init::Init()
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:std::__ioinit
mov edi, OFFSET FLAT:std::ios_base::Init::~Init()
add rsp, 8
jmp __cxa_atexit
EXPLICIT INLINE
#include <iostream>
inline int func(int i)
{
return i * i;
}
int main(int argc, char *argv[]) {
auto value = atoi(argv[1]);
std::cout << func(value);
value = atoi(argv[2]);
std::cout << func(value);
return 1;
}
results in
main:
push rbx
mov rdi, QWORD PTR [rsi+8]
mov rbx, rsi
mov edx, 10
xor esi, esi
call strtol
imul eax, eax
mov edi, OFFSET FLAT:std::cout
mov esi, eax
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov rdi, QWORD PTR [rbx+16]
mov edx, 10
xor esi, esi
call strtol
imul eax, eax
mov edi, OFFSET FLAT:std::cout
mov esi, eax
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov eax, 1
pop rbx
ret
_GLOBAL__sub_I_main:
sub rsp, 8
mov edi, OFFSET FLAT:std::__ioinit
call std::ios_base::Init::Init()
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:std::__ioinit
mov edi, OFFSET FLAT:std::ios_base::Init::~Init()
add rsp, 8
jmp __cxa_atexit
In the example, if the line 5 is commented out, optimisation of the code inlines the function 'func' at the two call sites, but it leaves the assembly of func in the produced binary. However, if 'func' is explicitly inlined, the function does not exist in the output assembly.
Why does the GCC optimiser leave implicitly inlined functions in the compiled assembly, even though the operations of the inlined function are truly inlined with the calling code?

The function would go away entirely if it were marked static. Because it is not currently marked static, it may be referenced by another compilation unit, so it can't be eliminated.

Related

Why is the object prefix converted to function argument?

In the learncpp article about the hidden this pointer, the author mentioned that the compiler converts the object prefix to an argument passed by address to the function.
In the example:
simple.setID(2);
Will be converted to:
setID(&simple, 2); // note that simple has been changed from an object prefix to a function argument!
Why does the compiler do this? I've tried searching other documentation about it but couldn't find any. I've asked other people but they say it is a mistake or the compiler doesn't do that.
I have a second question on this topic. Let's go back to the example:
simple.setID(2); //Will be converted to setID(&simple, 2);
If the compiler converts it, won't it just look exactly like a function that has a name of setID and has two parameters?
void setID(MyClass* obj, int id) {
return;
}
int main() {
MyClass simple;
simple.setID(2); //Will be converted to setID(&simple, 2);
setID(&simple, 2);
}
Line 6 and 7 would look exactly the same.
object prefix to an argument passed by address to the function
This refers to how implementations use to translate it to machine code (but they could do it any other way)
Why does the compiler do this?
In some way, you need to be able to refer to the object in the called member function, and one way is to just handle it like an argument.
If the compiler converts it, won't it just look exactly like a function that has a name of setID and has two parameters?
If you have this code:
struct Test {
int v = 0;
Test(int v ) : v(v) {
}
void test(int a) {
int v = this->v;
int r = a;
}
};
void test(Test* t, int a) {
int v = t->v;
int r = a + v;
}
int main() {
Test a(2);
a.test(1);
test(&a, 1);
return 0;
}
gcc-12 will create this assembly code (for x86 and if optimizations are turned off):
Test::Test(int) [base object constructor]:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov DWORD PTR [rbp-12], esi
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rbp-12]
mov DWORD PTR [rax], edx
nop
pop rbp
ret
Test::test(int a):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
// int v = this->v;
mov rax, QWORD PTR [rbp-24]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
// int r = a;
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-8], eax
// end of function
nop
pop rbp
ret
test(Test* t, int a):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
// int v = t->v;
mov rax, QWORD PTR [rbp-24]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
// int r = a + v;
mov edx, DWORD PTR [rbp-28]
mov eax, DWORD PTR [rbp-4]
add eax, edx
mov DWORD PTR [rbp-8], eax
// end of function
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-4]
mov esi, 2
mov rdi, rax
call Test::Test(int) [complete object constructor]
// a.test(1);
lea rax, [rbp-4]
mov esi, 1
mov rdi, rax
call Test::test(int)
// test(&a, 1);
lea rax, [rbp-4]
mov esi, 1
mov rdi, rax
call test(Test*, int)
// end of main
mov eax, 0
leave
ret
So the machine code generated with no optimizations, looks identical for test(&a, 1) and a.test(1). And that's what the statement refers to.
But again that is an implementation detail how the compiler translates c++ to machine code, and not related to c++ itself.

Modulus in Assembly x64 linux question C++ [duplicate]

This question already has answers here:
Why does GCC use multiplication by a strange number in implementing integer division?
(5 answers)
Divide Signed Integer By 2 compiles to complex assembly output, not just a shift
(1 answer)
Closed 1 year ago.
I have these functions in C++
int f1(int a)
{
int x = a / 2;
}
int f2(int a)
{
int y = a % 2;
}
int f3(int a)
{
int z = a % 7;
}
int f4(int a,int b)
{
int xy = a % b;
}
And i saw their assembly code but couldn't understand what they are doing.I couldn't even find a good referance or some explained example for the same. Here is the assembly
f1(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
mov edx, eax
shr edx, 31
add eax, edx
sar eax
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f2(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
cdq
shr edx, 31
add eax, edx
and eax, 1
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f3(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
movsx rdx, eax
imul rdx, rdx, -1840700269
shr rdx, 32
add edx, eax
sar edx, 2
mov esi, eax
sar esi, 31
mov ecx, edx
sub ecx, esi
mov edx, ecx
sal edx, 3
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f4(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
cdq
idiv DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], edx
nop
pop rbp
ret
Can you please tell by some example or what steps it is following to calculate the answers in all these three cases and why would they work just fine instead of normal divide

What is more efficient in this case, using const char* or std::string

I am using a combination of C and C++ code in my application.
I want to print out if a boolean flag is true or false as below, by using a ternary operator to determine the string to print.
If I use a const char*, doesn't the compiler more than likely store these string literals "Yes" and "No" in some read-only memory before the program starts.
If I use std::string, when the string goes out of scope, it will be destroyed? But I guess the complier still needs to store the string literals "Yes" and "No" somewhere anyways? I'm not sure.
bool isSet = false;
// More code
//std::string isSetStr = isSet ? "Yes" : "No";
const char* isSetStr = isSet ? "Yes" : "No";
//printf ( "Flag is set ? : %s\n", isSetStr.c_str());
printf ( "Flag is set ? : %s\n", isSetStr);
Either version will allocate the string literals themselves in read-only memory. Either version uses a local variable that goes out of scope, but the string literals remain since they aren't stored locally.
Regarding performance, C++ container classes are almost always going to be more inefficient than "raw" C. When testing your code with g++ -O3 I get this:
void test_cstr (bool isSet)
{
const char* isSetStr = isSet ? "Yes" : "No";
printf ( "Flag is set ? : %s\n", isSetStr);
}
Disassembly (x86):
.LC0:
.string "Yes"
.LC1:
.string "No"
.LC2:
.string "Flag is set ? : %s\n"
test_cstr(bool):
test dil, dil
mov eax, OFFSET FLAT:.LC1
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC2
cmove rsi, rax
xor eax, eax
jmp printf
The string literals are loaded into read-only locations and the isSetStr variable is simply optimized away.
Now try this using the same compiler and options (-O3):
void test_cppstr (bool isSet)
{
std::string isSetStr = isSet ? "Yes" : "No";
printf ( "Flag is set ? : %s\n", isSetStr.c_str());
}
Disassembly (x86):
.LC0:
.string "Yes"
.LC1:
.string "No"
.LC2:
.string "Flag is set ? : %s\n"
test_cppstr(bool):
push r12
mov eax, OFFSET FLAT:.LC1
push rbp
push rbx
mov ebx, OFFSET FLAT:.LC0
sub rsp, 32
test dil, dil
cmove rbx, rax
lea rbp, [rsp+16]
mov QWORD PTR [rsp], rbp
mov rdi, rbx
call strlen
xor edx, edx
mov esi, eax
test eax, eax
je .L7
.L6:
mov ecx, edx
add edx, 1
movzx edi, BYTE PTR [rbx+rcx]
mov BYTE PTR [rbp+0+rcx], dil
cmp edx, esi
jb .L6
.L7:
mov QWORD PTR [rsp+8], rax
mov edi, OFFSET FLAT:.LC2
mov BYTE PTR [rsp+16+rax], 0
mov rsi, QWORD PTR [rsp]
xor eax, eax
call printf
mov rdi, QWORD PTR [rsp]
cmp rdi, rbp
je .L1
call operator delete(void*)
.L1:
add rsp, 32
pop rbx
pop rbp
pop r12
ret
mov r12, rax
jmp .L4
test_cppstr(bool) [clone .cold]:
.L4:
mov rdi, QWORD PTR [rsp]
cmp rdi, rbp
je .L5
call operator delete(void*)
.L5:
mov rdi, r12
call _Unwind_Resume
The string literals are still allocated in read-only memory so that part is the same. But we got a massive chunk of overhead bloat code.
But on the other hand, the biggest bottleneck by far in this case is the console I/O so the performance of the rest of the code isn't even relevant. Strive to write the most readable code possible and only optimize when you actually need it. Manual string handling in C is fast, but it's also very error-prone and cumbersome.
You can test it with godbolt.
The former (using const char*) gives this:
.LC0:
.string "No"
.LC1:
.string "Yes"
.LC2:
.string "Flag is set ? : %s\n"
a(bool):
test dil, dil
mov eax, OFFSET FLAT:.LC0
mov esi, OFFSET FLAT:.LC1
cmove rsi, rax
mov edi, OFFSET FLAT:.LC2
xor eax, eax
jmp printf
The latter (using std::string) gives this:
.LC0:
.string "Yes"
.LC1:
.string "No"
.LC2:
.string "Flag is set ? : %s\n"
a(bool):
push r12
push rbp
mov r12d, OFFSET FLAT:.LC1
push rbx
mov esi, OFFSET FLAT:.LC0
sub rsp, 32
test dil, dil
lea rax, [rsp+16]
cmovne r12, rsi
or rcx, -1
mov rdi, r12
mov QWORD PTR [rsp], rax
xor eax, eax
repnz scasb
not rcx
lea rbx, [rcx-1]
mov rbp, rcx
cmp rbx, 15
jbe .L3
mov rdi, rcx
call operator new(unsigned long)
mov QWORD PTR [rsp+16], rbx
mov QWORD PTR [rsp], rax
.L3:
cmp rbx, 1
mov rax, QWORD PTR [rsp]
jne .L4
mov dl, BYTE PTR [r12]
mov BYTE PTR [rax], dl
jmp .L5
.L4:
test rbx, rbx
je .L5
mov rdi, rax
mov rsi, r12
mov rcx, rbx
rep movsb
.L5:
mov rax, QWORD PTR [rsp]
mov QWORD PTR [rsp+8], rbx
mov edi, OFFSET FLAT:.LC2
mov BYTE PTR [rax-1+rbp], 0
mov rsi, QWORD PTR [rsp]
xor eax, eax
call printf
mov rdi, QWORD PTR [rsp]
lea rax, [rsp+16]
cmp rdi, rax
je .L6
call operator delete(void*)
jmp .L6
mov rdi, QWORD PTR [rsp]
lea rdx, [rsp+16]
mov rbx, rax
cmp rdi, rdx
je .L8
call operator delete(void*)
.L8:
mov rdi, rbx
call _Unwind_Resume
.L6:
add rsp, 32
xor eax, eax
pop rbx
pop rbp
pop r12
ret
Using std::string_view such as:
#include <stdio.h>
#include <string_view>
int a(bool isSet) {
// More code
std::string_view isSetStr = isSet ? "Yes" : "No";
//const char* isSetStr = isSet ? "Yes" : "No";
printf ( "Flag is set ? : %s\n", isSetStr.data());
//printf ( "Flag is set ? : %s\n", isSetStr);
}
gives:
.LC0:
.string "No"
.LC1:
.string "Yes"
.LC2:
.string "Flag is set ? : %s\n"
a(bool):
test dil, dil
mov eax, OFFSET FLAT:.LC0
mov esi, OFFSET FLAT:.LC1
cmove rsi, rax
mov edi, OFFSET FLAT:.LC2
xor eax, eax
jmp printf
So to sum up, both const char* and string_view gives optimal code. string_view is a bit more code to type compared to const char*.
std::string is made to manipulate string content, so it's overkill here and leads to less efficient code.
Another remark with string_view: It does not guarantee that the string is NUL terminated. In this case, it is, since it's built from a NUL terminated static string. For a generic string_view usage with printf, use printf("%.*s", str.length(), str.data());
EDIT: By disabling exception handling, you can reduce std::string version to:
.LC0:
.string "Yes"
.LC1:
.string "No"
.LC2:
.string "Flag is set ? : %s\n"
a(bool):
push r12
mov eax, OFFSET FLAT:.LC1
push rbp
mov ebp, OFFSET FLAT:.LC0
push rbx
sub rsp, 32
test dil, dil
cmove rbp, rax
lea r12, [rsp+16]
mov QWORD PTR [rsp], r12
mov rdi, rbp
call strlen
mov rsi, rbp
mov rdi, r12
lea rdx, [rbp+0+rax]
mov rbx, rax
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy_chars(char*, char const*, char const*)
mov rax, QWORD PTR [rsp]
mov QWORD PTR [rsp+8], rbx
mov edi, OFFSET FLAT:.LC2
mov BYTE PTR [rax+rbx], 0
mov rsi, QWORD PTR [rsp]
xor eax, eax
call printf
mov rdi, rsp
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose()
add rsp, 32
pop rbx
pop rbp
pop r12
ret
which is still a lot more than the string_view's version. Remark that the compiler was smart enough to remove the memory allocation on the heap here, but it's still forced to compute the string's length (even if printf will also compute it itself).
Chill out!
The printf will be orders of magnitude slower than any construction of a std::string from const char[] data embedded in the program source code.
Always use a profiler when examining code performance. Writing a small program in an attempt to test a hypothesis will often fail to tell you anything about what is happening in your big program. In the case you present, a good compiler will optimise to
int main(){printf ( "Flag is set ? : No\n");}
String literals have static storage duration, They are alive until program ends.
Pay attention to that if you are using in a program the same string literal it is not necessary that the compiler stores this string literal as one object.
That is this expression
"Yes" == "Yes"
can yield either true or false depending on compiler options. But usually by defaults identical string literals are stored as one string literal.
Objects of the type std::string if they are not declared in a namespace and without the keyword static has automatic storage duration. It means that when the control is passed to a block such an object is created anew and destroyed each time when the control leaves the block.
Type of isSet ? "Yes" : "No" is const char*, independently of the fact that you store it inside std::string or a const char* (or std::stringview, or ...). (so string literals are treated equally by the compiler).
According to quick-bench.com,
std::string version is ~6 times slower, which is understandable as it requires extra dynamic allocation.
Unless you need the extra feature of std::string, you might stay with const char*.
Equivalent C++ code:
#include <string>
using namespace std::string_literals;
void test_cppstr (bool isSet)
{
const std::string& isSetStr = isSet ? "Yes"s : "No"s;
printf ( "Flag is set ? : %s\n", isSetStr.c_str());
}
Efficient almost like C version.
Edit: This version has small overhead with setup/exit, but has same efficiency as C code in calling printf.
#include <string>
using namespace std::string_literals;
const std::string yes("Yes");
const std::string no("No");
void test_cppstr (bool isSet)
{
const std::string& isSetStr = isSet ? yes : no;
printf ( "Flag is set ? : %s\n", isSetStr.c_str());
}
https://godbolt.org/z/v3ebcsrYE

Why there is no `leave` instruction at function epilog on x64? [duplicate]

This question already has answers here:
Why does the x86-64 GCC function prologue allocate less stack than the local variables?
(1 answer)
Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
(2 answers)
Closed 4 years ago.
I'm on the way to get idea how the stack works on x86 and x64 machines. What I observed however is that when I manually write a code and disassembly it, it differs from what I see in the code people provide (eg. in their questions and tutorials). Here is little example:
Source
int add(int a, int b) {
int c = 16;
return a + b + c;
}
int main () {
add(3,4);
return 0;
}
x86
add(int, int):
push ebp
mov ebp, esp
sub esp, 16
mov DWORD PTR [ebp-4], 16
mov edx, DWORD PTR [ebp+8]
mov eax, DWORD PTR [ebp+12]
add edx, eax
mov eax, DWORD PTR [ebp-4]
add eax, edx
leave (!)
ret
main:
push ebp
mov ebp, esp
push 4
push 3
call add(int, int)
add esp, 8
mov eax, 0
leave (!)
ret
Now goes x64
add(int, int):
push rbp
mov rbp, rsp
(?) where is `sub rsp, X`?
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov DWORD PTR [rbp-4], 16
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
(?) where is `mov rsp, rbp` before popping rbp?
pop rbp
ret
main:
push rbp
mov rbp, rsp
mov esi, 4
mov edi, 3
call add(int, int)
mov eax, 0
(?) where is `mov rsp, rbp` before popping rbp?
pop rbp
ret
As you can see, my main confusion is that when I compile against x86 - I see what I expect. When it's x64 - I miss leave instruction or exact following sequence: mov rsp, rbp then pop rbp. What's worng?
UPDATE
It seems like leave is missing, just because it wasn't altered previously. But then, goes another question - why there is no allocation for local vars in the frame?
To this question #melpomene gives pretty straightforward answer - because of "red zone". Which basically means the function that calls no further functions (leaf) can use the first 128 bytes below the stack without allocating space. So if I insert a call inside an add() to any other dumb function - sub rsp, X and add rsp, X will be added to prologue and epilogue respectively.

Passing a r-value-reference to constructor to reduce copies

I have the following code lines
#include <stdio.h>
#include <utility>
class A
{
public: // member functions
explicit A(int && Val)
{
_val = std::move(Val); // \2\
}
virtual ~A(){}
private: // member variables
int _val = 0;
private: // member functions
A(const A &) = delete;
A& operator = (const A &) = delete;
A(A &&) = delete;
A&& operator = (A &&) = delete;
};
int main()
{
A a01{3}; // \1\
return 0;
}
I would like to ask how many copies did I make from \1\ to \2\?
Your code doesn't compile, but after making the changes needed for it to compile, it does nothing and compiles into this x86 assembly because none of it's values are ever used:
main:
xor eax, eax
ret
https://godbolt.org/z/q70EMb
Modifying the code so that it requires the output of the _val member variable (with a print statement) shows that with optimizations it simply moves the value 0x03 into a register and prints it:
.LC0:
.string "%d\n"
main:
sub rsp, 8
mov esi, 3
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
https://godbolt.org/z/JG73Ll
If you disable optimizations in an attempt to get the compiler to output a more verbose version of the program:
A::A(int&&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 0
mov rax, QWORD PTR [rbp-16]
mov rdi, rax
call std::remove_reference<int&>::type&& std::move<int&>(int&)
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
nop
leave
ret
.LC0:
.string "%d\n"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
lea rdx, [rbp-4]
lea rax, [rbp-8]
mov rsi, rdx
mov rdi, rax
call A::A(int&&)
mov eax, DWORD PTR [rbp-8]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
std::remove_reference<int&>::type&& std::move<int&>(int&):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
https://godbolt.org/z/ZTK40d
The answer to your question depends on how your program is compiled and how copy elision is enforced, as well as if there is any benefit in the case of an int to not "copying" a value, since an int* and int likely take up the same amount of memory.
your are merely assigning a value, not copying. Nevertheless, you can have a static member in your class that is incremented everytime this method is called!
class A
{
public: // member functions
static int counter = 0;
explicit A(int && Val)
{
_val = std::move(Val); // \2\
counter++;
}
....