Are arguments loaded into the cache for empty functions? - c++

I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!

Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.

Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).

Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)

Related

Which is faster, a struct, or a primitive variable containing the same bytes?

Here is an example piece of code:
#include <stdint.h>
#include <iostream>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int main() {
//simply to make the answer unknowable at compile time
uint16_t input;
cin >> input;
A a = {15,input};
B b = 0x000f0000 + input;
//a equals b
int resultA = a.low-a.high;
int resultB = b&0xffff - (b>>16)&0xffff;
//use the variables so the optimiser doesn't get rid of everything
return resultA+resultB;
}
Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).
I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)
Here is the output of Compiler Explorer with no optimisation flag:
push rbp
mov rbp, rsp
sub rsp, 32
mov dword ptr [rbp - 4], 0
movabs rdi, offset std::cin
lea rsi, [rbp - 6]
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
mov word ptr [rbp - 16], 15
mov ax, word ptr [rbp - 6]
mov word ptr [rbp - 14], ax
movzx eax, word ptr [rbp - 6]
add eax, 983040
mov dword ptr [rbp - 20], eax
Begin calculating result A
movzx eax, word ptr [rbp - 16]
movzx ecx, word ptr [rbp - 14]
sub eax, ecx
mov dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
mov eax, dword ptr [rbp - 20]
mov edx, dword ptr [rbp - 20]
shr edx, 16
mov ecx, 65535
sub ecx, edx
and eax, ecx
and eax, 65535
mov dword ptr [rbp - 28], eax
End of calculation
mov eax, dword ptr [rbp - 24]
add eax, dword ptr [rbp - 28]
add rsp, 32
pop rbp
ret
I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).
main: # #main
push rax
lea rsi, [rsp + 6]
mov edi, offset std::cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
movzx ecx, word ptr [rsp + 6]
mov eax, ecx
and eax, -16
sub eax, ecx
add eax, 15
pop rcx
ret
Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?
This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?
[Why did I add C to the tag list? While the example code I used is in C++, the concept of struct vs variable is very applicable to C too]
Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.
Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.
Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.
Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.
This only matters for by-value single arguments however.
The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.
When trying to answer questions of performance, examining unoptimized code is largely irrelevant.
As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.
Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.
As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.
Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.
Code:
#include <stdint.h>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int f1(A a)
{
return a.low - a.high;
}
int f2(B b)
{
return b&0xffff - (b>>16)&0xffff;
}
Assembly:
_Z2f11A:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
_Z2f2j:
movl %edi, %edx
movl $65535, %eax
shrl $16, %edx
subl %edx, %eax
andl %edi, %eax
ret
But this might be because the two functions don't do the same thing as - has a higher precedence than &. When comparing the B case which does the same thing as A, then the exact same assembly is produced.
Code:
int f3(B b)
{
return (b&0xffff) - ((b>>16)&0xffff);
}
Assembly:
_Z2f3j:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
Note that the only way to find out if something is faster is to benchmark it in a real world use case.

Why is there no `mov %rsp, %rbp` in function prologue?

In assembly, many functions begin with the following prologue:
00000001004010e0: main(int, char**)+0 push %rbp
00000001004010e1: main(int, char**)+1 mov %rsp,%rbp
Some functions, like the one below, do not:
int MainEntry(){
MainEntry():
0000000100401104: MainEntry()+0 push %rbp
0000000100401105: MainEntry()+1 push %rbx
0000000100401106: MainEntry()+2 sub $0x48,%rsp
000000010040110a: MainEntry()+6 lea 0x80(%rsp),%rbp
vector<int> v;
0000000100401112: MainEntry()+14 lea -0x60(%rbp),%rax
0000000100401116: MainEntry()+18 mov %rax,%rcx
0000000100401119: MainEntry()+21 callq 0x100401b00 <std::vector<int, std::allocator<int> >::vector()>
return 0;
000000010040111e: MainEntry()+26 mov $0x0,%ebx
0000000100401123: MainEntry()+31 lea -0x60(%rbp),%rax
0000000100401127: MainEntry()+35 mov %rax,%rcx
000000010040112a: MainEntry()+38 callq 0x100401b20 <std::vector<int, std::allocator<int> >::~vector()>
000000010040112f: MainEntry()+43 mov %ebx,%eax
}
Here is the C++ code that compiles into this:
int main(int c, char** args){
MainEntry();
return 0;
}
int MainEntry(){
vector<int> v;
return 0;
}
So here are my two questions:
In the MainEntry function, there is a push %rbp, and then a push %rbx. Why is RBX pushed onto the stack?
If I understand correctly, sub $0x48, %rsp allocates 0x48 bytes on the stack, and lea 0x80(%rsp), %rbp moves 0x80 bytes down on the stack and assigns that as the base. Where is RBP going to end up in the local stack frame and how did it get there?
rbx is pushed onto the stack because the calling convention says it is preserved across calls.
This function is compiled without frame pointers. rbp is just another general purpose register when compiling without frame pointers.
About the question in the title (now improved)
The push rsp, rbp instruction doesn't exist. push always takes one argument. Perhaps you meant to ask why rbp isn't pushed. The answer is that nothing uses it and so no instructions are required to preserve it.

C/C++ : is using the result of comparison as int really branchless?

I have seen in many SO answers that kind of code:
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
Where authors say that this branchless.
But is this really branchless on current architectures? (x86, ARM...)
And is there a real standard guarantee that this is branchless?
x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.
If you use the “naïve” approach
int imax(int a, int b) {
return a > b ? a : b;
}
The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.
ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.
I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…
const struct op {
const char *foo;
int bar;
int flags;
} ops[] = {
{ "foo", 5, 16 },
{ "bar", 9, 16 },
{ "baz", 13, 0 },
{ 0, 0, 0 }
};
extern int foo(const struct op *, int);
int
bar(void *a, void *b, int c, const struct op *d)
{
c |= (a == b) && (d->flags & 16);
return foo(d, c) + 1;
}
… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:
[…]
.text
.p2align 2,,3
.globl bar
.type bar , #function
bar:
push %ebp
mov %ebp, %esp
push %ebx
push %eax
mov %eax, DWORD PTR [%ebp+12]
xor %ecx, %ecx
cmp DWORD PTR [%ebp+8], %eax
mov %edx, DWORD PTR [%ebp+16]
mov %ebx, DWORD PTR [%ebp+20]
je .L4
.L2:
sub %esp, 8
or %edx, %ecx
push %edx
push %ebx
call foo
inc %eax
mov %ebx, DWORD PTR [%ebp-4]
leave
ret
.p2align 2,,3
.L4:
test BYTE PTR [%ebx+8], 16
je .L2
mov %cl, 1
jmp .L2
.size bar , . - bar
Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.
Not even using !!(a == b) makes a difference here.
tl;dr
Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

How are non-POD static values initialized? [duplicate]

This question already has answers here:
What is the lifetime of a static variable in a C++ function?
(5 answers)
Closed 8 years ago.
C++, unlike some other languages, allows static data to be of any arbitrary type, not just plain-old-data. Plain-old-data is trivial to initialize (the compiler just writes the value at the appropriate address in the data segment), but the other, more complex types, are not.
How is initialization of non-POD types typically implemented in C++? In particular, what exactly happens when the function foo is executed for the first time? What mechanisms are used to keep track of whether str has already been initialized or not?
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
C++11 requires the initialization of function local static variables to be thread-safe. So at least in compilers that are compliant, there'll typically be some sort of synchronization primitive in use that'll need to be checked each time the function is entered.
For example, here's the assembly listing for the code from this program:
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
int main() {}
.LC0:
.string "Hello, Stack Overflow!"
foo():
cmpb $0, guard variable for foo()::str(%rip)
je .L14
ret
.L14:
pushq %rbx
movl guard variable for foo()::str, %edi
subq $16, %rsp
call __cxa_guard_acquire
testl %eax, %eax
jne .L15
.L1:
addq $16, %rsp
popq %rbx
ret
.L15:
leaq 15(%rsp), %rdx
movl $.LC0, %esi
movl foo()::str, %edi
call std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)
movl guard variable for foo()::str, %edi
call __cxa_guard_release
movl $__dso_handle, %edx
movl foo()::str, %esi
movl std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string(), %edi
call __cxa_atexit
jmp .L1
movq %rax, %rbx
movl guard variable for foo()::str, %edi
call __cxa_guard_abort
movq %rbx, %rdi
call _Unwind_Resume
main:
xorl %eax, %eax
ret
The __cxa_guard_acquire, __cxa_guard_release etc. are guarding initialization of the static variable.
The implementation that I've seen uses a hidden boolean variable to check if the variable is initialized. Modern compiler will do this thread-safely, but IIRC, some older compilerd did not do that, and if it was called from several threads at the same time you could get the constructor called twice.
Something along the lines of:
static bool __str_initialized = false;
static char __mem_for_str[...]; //std::string str("Hello, Stack Overflow!");
void foo() {
if (!__str_initialized)
{
lock();
__str_initialized = true;
new (__mem_for_str) std::string("Hello, Stack Overflow!");
unlock();
}
}
Then, in the finalization code of the program:
if (__str_initialized)
((std::string&)__mem_for_str).~std::string();
It's implementation specific.
Typically, there'll be a flag (statically initialised to zero) to indicate whether it's initialised, and (in C++11, or earlier thread-safe implementations) some kind of mutex, also statically initialisable, to protect against multiple threads trying to in initialise it.
The generated code would typically behave something along the lines of
static __atomic_flag_type __initialised = false;
static __mutex_type __mutex = __MUTEX_INITIALISER;
if (!__initialised) {
__lock_type __lock(__mutex);
if (!__initialised) {
__initialise(str);
__initialised = true;
}
}
You can check what your compiler does by generating an assembler listing.
MSVC2008 in debug mode generates this code (excluding exception handling prolog/epilog etc):
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
and eax, 1
jne SHORT $LN1#foo
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
or eax, 1
mov DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA, eax
mov DWORD PTR __$EHRec$[ebp+8], 0
mov esi, esp
push OFFSET ??_C#_0BH#ENJCLPMJ#Hello?0?5Stack?5Overflow?$CB?$AA#
mov ecx, OFFSET ?str#?1??foo##YA_NXZ#4V?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##A
call DWORD PTR __imp_??0?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##QAE#PBD#Z
cmp esi, esp
call __RTC_CheckEsp
push OFFSET ??__Fstr#?1??foo##YA_NXZ#YAXXZ ; `foo'::`2'::`dynamic atexit destructor for 'str''
call _atexit
add esp, 4
mov DWORD PTR __$EHRec$[ebp+8], -1
$LN1#foo:
i.e there is a static variable referenced by ?$S1#?1??foo##YA_NXZ#4IA this is checked to see if it & 1 is zero. if not it branches to the label $LN1#foo:. Otherwise it or's in 1 to the flag, constructs the string at a known location and then adds a call for its destructor at program exit using 'atexit'. Then continues the function as normal.

Inline with containing complex function

Let's take this for example:
Class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClas::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClas::functionComplex()
{
/* many complex
instructions
*/
}
void myFunc()
{
TestClass testVar;
testVar.functionInline();
}
Suposing that all coments are in fact lines of code that are single line or many and complex lines of code. The equivalent code would be (after compilation):
void myFunc()
{
TestClass testVar;
// a single instruction
return functionComplex();
}
or would be:
void myFunc()
{
TestClass testVar;
// a single instruction
/* many complex
instructions
*/
}
In other words, would a normal function be inserted inline if called inside an inline function or not?
If the compiler can see that the function is not called anywhere else (e.g. it is static in the case of a free function), then at least gcc has inlined it for a long time.
Of course, this also assumes the compiler can actually "see" the source code of the function - only if you use "whole program optimisation" (available in at least MS and GCC compilers), does it inline functions that aren't either in the source file or headers included in the source.
Obviously, inlining a "large" function has very little benefit (because the overhead of making the call is such a small portion of the total runtime), and if the function gets called more than once (or "may be called more than once" by not being static), the compiler will almost certainly not inline a "large" function.
In summary: maybe the large function is inline, but quite likely not.
Please check the assembly code that I generated both for VC++ 2010 and g++.
Both the compilers dont actually treat any of the function as inline in this example.
Code:
class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClass::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClass::functionComplex()
{
/* many complex
instructions
*/
return 0;
}
int main(){
TestClass t;
t.functionInline();
return 0;
}
VC++ 2010:
int main(){
01372E50 push ebp
01372E51 mov ebp,esp
01372E53 sub esp,0CCh
01372E59 push ebx
01372E5A push esi
01372E5B push edi
01372E5C lea edi,[ebp-0CCh]
01372E62 mov ecx,33h
01372E67 mov eax,0CCCCCCCCh
01372E6C rep stos dword ptr es:[edi]
TestClass t;
t.functionInline();
01372E6E lea ecx,[t]
01372E71 call TestClass::functionInline (1371677h)
return 0;
01372E76 xor eax,eax
}
Linux G++:
main:
.LFB3:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
leaq -1(%rbp), %rax
movq %rax, %rdi
call _ZN9TestClass14functionInlineEv
movl $0, %eax
leave
ret
.cfi_endproc
Both the lines
01372E71 call TestClass::functionInline (1371677h)
and
call _ZN9TestClass14functionInlineEv
indicate that the function functionInline is not inline.
Now have a look at functionInline assembly:
inline int TestClass::functionInline()
{
01372E00 push ebp
01372E01 mov ebp,esp
01372E03 sub esp,0CCh
01372E09 push ebx
01372E0A push esi
01372E0B push edi
01372E0C push ecx
01372E0D lea edi,[ebp-0CCh]
01372E13 mov ecx,33h
01372E18 mov eax,0CCCCCCCCh
01372E1D rep stos dword ptr es:[edi]
01372E1F pop ecx
01372E20 mov dword ptr [ebp-8],ecx
// a single instruction
return functionComplex();
01372E23 mov ecx,dword ptr [this]
01372E26 call TestClass::functionComplex (1371627h)
}
Hence, functionComplex is not also inline.
No it would not be inlined. It is impossible beacause the compiler does not have available the body definition of a non-inlined function that could be located in another translation unit. I suppose normal function is a non-inlined function.
No, if you want your complex function be inserted inline, you must specify the inline keyword too.
In practice, use __forceinline keyword (on windows, __always_inline on linux) otherwise the compiler will ignore the keyword if there is a lot of instructions.
First of all inline function is just a directive to the compiler. It is not guaranteed that compiler will do the inlining.
Secondly, when you specify a function as inline, it tells compiler two things
1) Function might be a candidate for inlining. Whether it is going to be inlined is not guaranteed
2) This function has internal linkage. That is, function will be visible only in the translation unit it is compiled. This internal linkage is guaranteed irrespective of whether the function is actually inlined or not.
In you case, functionInline is specified as inline but functionComplex is not. functionComplex has external linkage. Compiler will never do the inlining of function with external linkage.
So, a plain answer to your question is "No" a normal (function without inline keyword and defined outside class)function will never be inlined