Let's take this for example:
Class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClas::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClas::functionComplex()
{
/* many complex
instructions
*/
}
void myFunc()
{
TestClass testVar;
testVar.functionInline();
}
Suposing that all coments are in fact lines of code that are single line or many and complex lines of code. The equivalent code would be (after compilation):
void myFunc()
{
TestClass testVar;
// a single instruction
return functionComplex();
}
or would be:
void myFunc()
{
TestClass testVar;
// a single instruction
/* many complex
instructions
*/
}
In other words, would a normal function be inserted inline if called inside an inline function or not?
If the compiler can see that the function is not called anywhere else (e.g. it is static in the case of a free function), then at least gcc has inlined it for a long time.
Of course, this also assumes the compiler can actually "see" the source code of the function - only if you use "whole program optimisation" (available in at least MS and GCC compilers), does it inline functions that aren't either in the source file or headers included in the source.
Obviously, inlining a "large" function has very little benefit (because the overhead of making the call is such a small portion of the total runtime), and if the function gets called more than once (or "may be called more than once" by not being static), the compiler will almost certainly not inline a "large" function.
In summary: maybe the large function is inline, but quite likely not.
Please check the assembly code that I generated both for VC++ 2010 and g++.
Both the compilers dont actually treat any of the function as inline in this example.
Code:
class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClass::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClass::functionComplex()
{
/* many complex
instructions
*/
return 0;
}
int main(){
TestClass t;
t.functionInline();
return 0;
}
VC++ 2010:
int main(){
01372E50 push ebp
01372E51 mov ebp,esp
01372E53 sub esp,0CCh
01372E59 push ebx
01372E5A push esi
01372E5B push edi
01372E5C lea edi,[ebp-0CCh]
01372E62 mov ecx,33h
01372E67 mov eax,0CCCCCCCCh
01372E6C rep stos dword ptr es:[edi]
TestClass t;
t.functionInline();
01372E6E lea ecx,[t]
01372E71 call TestClass::functionInline (1371677h)
return 0;
01372E76 xor eax,eax
}
Linux G++:
main:
.LFB3:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
leaq -1(%rbp), %rax
movq %rax, %rdi
call _ZN9TestClass14functionInlineEv
movl $0, %eax
leave
ret
.cfi_endproc
Both the lines
01372E71 call TestClass::functionInline (1371677h)
and
call _ZN9TestClass14functionInlineEv
indicate that the function functionInline is not inline.
Now have a look at functionInline assembly:
inline int TestClass::functionInline()
{
01372E00 push ebp
01372E01 mov ebp,esp
01372E03 sub esp,0CCh
01372E09 push ebx
01372E0A push esi
01372E0B push edi
01372E0C push ecx
01372E0D lea edi,[ebp-0CCh]
01372E13 mov ecx,33h
01372E18 mov eax,0CCCCCCCCh
01372E1D rep stos dword ptr es:[edi]
01372E1F pop ecx
01372E20 mov dword ptr [ebp-8],ecx
// a single instruction
return functionComplex();
01372E23 mov ecx,dword ptr [this]
01372E26 call TestClass::functionComplex (1371627h)
}
Hence, functionComplex is not also inline.
No it would not be inlined. It is impossible beacause the compiler does not have available the body definition of a non-inlined function that could be located in another translation unit. I suppose normal function is a non-inlined function.
No, if you want your complex function be inserted inline, you must specify the inline keyword too.
In practice, use __forceinline keyword (on windows, __always_inline on linux) otherwise the compiler will ignore the keyword if there is a lot of instructions.
First of all inline function is just a directive to the compiler. It is not guaranteed that compiler will do the inlining.
Secondly, when you specify a function as inline, it tells compiler two things
1) Function might be a candidate for inlining. Whether it is going to be inlined is not guaranteed
2) This function has internal linkage. That is, function will be visible only in the translation unit it is compiled. This internal linkage is guaranteed irrespective of whether the function is actually inlined or not.
In you case, functionInline is specified as inline but functionComplex is not. functionComplex has external linkage. Compiler will never do the inlining of function with external linkage.
So, a plain answer to your question is "No" a normal (function without inline keyword and defined outside class)function will never be inlined
Related
Here is a snippet of the code I have:
class modbus {
public:
static const uint8_t modbusHeader = 2;
static const uint8_t modbusCRC = 2;
static const uint8_t modbusPDU = modbusHeader + modbusCRC;
static const uint8_t exceptionBase = 0x80;
static const uint32_t transmitTimeout = 5000;
};
It defines some sizes for the modbus packets that I need to create inside the class. I work inside an embedded environment and as such size optimisations and considerations are always there. As such I really want to have only one occurrence of these constant values inside my read only part of the flash.
I have chosen to set these variables as static but is this necessary? Would a compiler infer that these values need only be saved once inside the binary and as such only include them once when I remove the static keyword?
I suppose, technically, if the compiler knew that you never performed sizeof on modbus, and never took the address of these members through different modbus* pointers, and knew that they were only ever initialised with the exact same trivial value, it might use the "as-if" rule to merge them into one and remove them from the class in terms of storage. (If it couldn't guarantee just one of these things, the rules of the language would be violated.)
But that's a tall order (particularly when you consider multiple translation units), and would not really be useful.
So no. I don't expect that this would ever happen.
You should indeed make those things static const (with perhaps a sprinkling of constexpr).
The code
#include <cstdint>
class modbus {
public:
static const uint8_t modbusHeader = 2;
};
int main() {
return modbus::modbusHeader;
}
results in
main:
pushq %rbp
movq %rsp, %rbp
movl $2, %eax
popq %rbp
ret
and
#include <cstdint>
class modbus {
public:
const uint8_t modbusHeader = 2;
};
int main() {
return modbus().modbusHeader;
}
results in
modbus::modbus() [base object constructor]:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov BYTE PTR [rax], 2
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov BYTE PTR [rbp-1], 0
lea rax, [rbp-1]
mov rdi, rax
call modbus::modbus() [complete object constructor]
movzx eax, BYTE PTR [rbp-1]
movzx eax, al
leave
ret
At least here is a large difference. I used g++ 9.2 with -O0
If you insist on using C++, you must const-qualify an object of the class, not members of the class. You must then ensure that const variables with static storage duration end up in flash for your given target. Often the tool chain has a build called "flash-something". It's related to the linker rather than the compiler. If that part is working, it shouldn't matter if the object of the class is static or not, as long as the object is declared at file scope (static storage duration).
What you shouldn't do, is to just const qualify class members, rather than the object they reside in. Because that might cause silent C++ bloat to remain in the executable, in the form of default constructors called by the "CRT". You can verify this by single-stepping through the part of the CRT that deals with calling constructors of objects with static storage duration.
As for making members static, you should only need to do that for purposes like singleton, not for making members end up in flash.
I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)
Consider this code:
struct A{
volatile int x;
A() : x(12){
}
};
A foo(){
A ret;
//Do stuff
return ret;
}
int main()
{
A a;
a.x = 13;
a = foo();
}
Using g++ -std=c++14 -pedantic -O3 I get this assembly:
foo():
movl $12, %eax
ret
main:
xorl %eax, %eax
ret
According to my estimation the variable x should be written to at least three times (possibly four), yet it not even written once (the function foo isn't even called!)
Even worse when you add the inline keyword to foo this is the result:
main:
xorl %eax, %eax
ret
I thought that volatile means that every single read or write must happen even if the compiler can not see the point of the read/write.
What is going on here?
Update:
Putting the declaration of A a; outside main like this:
A a;
int main()
{
a.x = 13;
a = foo();
}
Generates this code:
foo():
movl $12, %eax
ret
main:
movl $13, a(%rip)
xorl %eax, %eax
movl $12, a(%rip)
ret
movl $12, a(%rip)
ret
a:
.zero 4
Which is closer to what you would expect....I am even more confused then ever
Visual C++ 2015 does not optimize away the assignments:
A a;
mov dword ptr [rsp+8],0Ch <-- write 1
a.x = 13;
mov dword ptr [a],0Dh <-- write2
a = foo();
mov dword ptr [a],0Ch <-- write3
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
}
xor eax,eax
ret
The same happens both with /O2 (Maximize speed) and /Ox (Full optimization).
The volatile writes are kept also by gcc 3.4.4 using both -O2 and -O3
_main:
pushl %ebp
movl $16, %eax
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call __alloca
call ___main
movl $12, -4(%ebp) <-- write1
xorl %eax, %eax
movl $13, -4(%ebp) <-- write2
movl $12, -8(%ebp) <-- write3
leave
ret
Using both of these compilers, if I remove the volatile keyword, main() becomes essentially empty.
I'd say you have a case where the compiler over-agressively (and incorrectly IMHO) decides that since 'a' is not used, operations on it arent' necessary and overlooks the volatile member. Making 'a' itself volatile could get you what you want, but as I don't have a compiler that reproduces this, I can't say for sure.
Last (while this is admittedly Microsoft specific), https://msdn.microsoft.com/en-us/library/12a04hfd.aspx says:
If a struct member is marked as volatile, then volatile is propagated to the whole structure.
Which also points towards the behavior you are seeing being a compiler problem.
Last, if you make 'a' a global variable, it is somewhat understandable that the compiler is less eager to deem it unused and drop it. Global variables are extern by default, so it is not possible to say that a global 'a' is unused just by looking at the main function. Some other compilation unit (.cpp file) might be using it.
GCC's page on Volatile access gives some insight into how it works:
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point. The use of volatile does not allow you to violate the restriction on updating objects multiple times between two sequence points.
In C standardese:
ยง5.1.2.3
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, 11) which are changes in the state of the
execution environment. Evaluation of an expression may produce side
effects. At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have
taken place. (A summary of the sequence points is given in annex C.)
3 In the abstract machine, all expressions are evaluated as specified
by the semantics. An actual implementation need not evaluate part of
an expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
[...]
5 The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet
occurred. [...]
I chose the C standard because the language is simpler but the rules are essentially the same in C++. See the "as-if" rule.
Now on my machine, -O1 doesn't optimize away the call to foo(), so let's use -fdump-tree-optimized to see the difference:
-O1
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
a = foo ();
a ={v} {CLOBBER};
return 0;
}
And -O3:
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A ret;
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
ret.x ={v} 12;
ret ={v} {CLOBBER};
a ={v} {CLOBBER};
return 0;
}
gdb reveals in both cases that a is ultimately optimized out, but we're worried about foo(). The dumps show us that GCC reordered the accesses so that foo() is not even necessary and subsequently all of the code in main() is optimized out. Is this really true? Let's see the assembly output for -O1:
foo():
mov eax, 12
ret
main:
call foo()
mov eax, 0
ret
This essentially confirms what I said above. Everything is optimized out: the only difference is whether or not the call to foo() is as well.
To get a better understanding of binary files I've prepared a small c++ example and used gdb to disassemble and look for the machine code.
The main() function calls the function func():
int func(void)
{
int a;
int b;
int c;
int d;
d = 4;
c = 3;
b = 2;
a = 1;
return 0;
}
The project is compiled with g++ keeping the debugging information. Next gdb is used to disassemble the source code. What I got for func() looks like:
0x00000000004004cc <+0>: push %rbp
0x00000000004004cd <+1>: mov %rsp,%rbp
0x00000000004004d0 <+4>: movl $0x4,-0x10(%rbp)
0x00000000004004d7 <+11>: movl $0x3,-0xc(%rbp)
0x00000000004004de <+18>: movl $0x2,-0x8(%rbp)
0x00000000004004e5 <+25>: movl $0x1,-0x4(%rbp)
0x00000000004004ec <+32>: mov $0x0,%eax
0x00000000004004f1 <+37>: pop %rbp
0x00000000004004f2 <+38>: retq
Now my problem is that I expect that the stack pointer should be moved by 16 bytes to lower addresses relative to the base pointer, since each integer value needs 4 bytes. But it looks like that the values are putted on the stack without moving the stack pointer.
What did I not understand correctly? Is this a problem with the compiler or did the assembler omit some lines?
Best regards,
NouGHt
There's absolutely no problem with your compiler. The compiler is free to choose how to compile your code, and it chose not to modify the stack pointer. There's no need for it to do so since your function doesn't call any other functions. If it did call another function then it would need to create another stack frame so that the callee did not stomp on the caller's stack frame.
As a general rule, you should avoid trying to make any assumptions on how the compiler will compile your code. For example, your compiler would be perfectly at liberty to opimize away the body of your function.
Can linux context switch after unlock in the below code if so we have a problem if two threads call this
inline bool CMyAutoLock::Lock(
pthread_mutex_t *pLock,
bool bBlockOk
)
throw ()
{
Unlock();
if (pLock == NULL)
return (false);
// **** can context switch happen here ? ****///
return ((((bBlockOk)? pthread_mutex_lock(pLock) :
pthread_mutex_trylock(pLock)) == 0)? (m_pLock = pLock, true) : false);
}
No, it's not atomic.
In fact, it may be especially likely for a context switch to occur after you've unlocked a mutex, because the OS knows if another thread is blocked on that mutex. (On the other hand, the OS doesn't even know whether you're executing an inline function.)
Inline functions are not automatically atomic. The inline keyword just means "when compiling this code, try to optimize away the call by replacing the call with the assembly instructions from the body of the call." You could get context-switched out on any of those assembly instructions just as you could in any other assembly instructions, and so you will need to guard the code with a lock.
inline is a complier hint that that suggests that the compiler inline the code into the caller rather than using function call semantics. However, it's just a hint, and isn't always heeded.
Futhermore, even if heeded, the result is that your code gets inlined into the calling function. It doesn't turn your code into an atomic sequence of instructions.
Inline makes a function work like a macro. Inline is not related to atomic in any way.
AFAIK inline is a hint and gcc might ignore it. When inlining, the code from your inline func, call it B, is copied into the caller func, A. There will be no call from A to B. This probably makes your exe faster at the expense of becomming larger. Probably? The exe could become smaller if your inline function is small. The exe could become slower if the inlining made it more difficult to optimize func A. If you don't specify inline, gcc will make the inline decision for you in a lot of cases. Member functions of classes are default inline. You need to explicitly tell gcc to not do automatic inlines. Also, gcc will not inline when optimizations are off.
The linker wont inline. So if module A extern referenced a function marked as inline, but the code was in module B, Module A will make calls to the function rather than inlining it. You have to define the function in the header file, and you have to declare it as extern inline func foo(a,b,c). Its actually a lot more complicated.
inline void test(void); // __attribute__((always_inline));
inline void test(void)
{
int x = 10;
long y = 10;
long long z = 10;
y++;
z = z + 10;
}
int main(int argc, char** argv)
{
test();
return (0);
}
Not Inline:
!{
main+0: lea 0x4(%esp),%ecx
main+4: and $0xfffffff0,%esp
main+7: pushl -0x4(%ecx)
main+10: push %ebp
main+11: mov %esp,%ebp
main+13: push %ecx
main+14: sub $0x4,%esp
! test();
main+17: call 0x8048354 <test> <--- making a call to test.
! return (0);
main()
main+22: mov $0x0,%eax
!}
main+27: add $0x4,%esp
main+30: pop %ecx
main+31: pop %ebp
main+32: lea -0x4(%ecx),%esp
main+35: ret
Inline:
inline void test(void)__attribute__((always_inline));
! int x = 10;
main+17: movl $0xa,-0x18(%ebp) <-- hey this is test code....in main()!
! long y = 10;
main+24: movl $0xa,-0x14(%ebp)
! long long z = 10;
main+31: movl $0xa,-0x10(%ebp)
main+38: movl $0x0,-0xc(%ebp)
! y++;
main+45: addl $0x1,-0x14(%ebp)
! z = z + 10;
main+49: addl $0xa,-0x10(%ebp)
main+53: adcl $0x0,-0xc(%ebp)
!}
!int main(int argc, char** argv)
!{
main+0: lea 0x4(%esp),%ecx
main+4: and $0xfffffff0,%esp
main+7: pushl -0x4(%ecx)
main+10: push %ebp
main+11: mov %esp,%ebp
main+13: push %ecx
main+14: sub $0x14,%esp
! test(); <-- no jump here
! return (0);
main()
main+57: mov $0x0,%eax
!}
main+62: add $0x14,%esp
main+65: pop %ecx
main+66: pop %ebp
main+67: lea -0x4(%ecx),%esp
main+70: ret
The only functions you can be sure are atomic are the gcc atomic builtins. Probably simple one opcode assembly instructions are atomic as well, but they might not be. In my experience so far on 6x86 setting or reading a 32 bit integer is atomic. You can guess if a line of c code could be atomic by looking at the generated assembly code.
The above code was compile in 32 bit mode. You can see that the long long takes 2 opcodes to load up. I am guessing that isn't atomic. The ints and longs take one opcode to set. Probably atomic. y++ is implemented with addl, which is probably atomic. I keep saying probably because the microcode on the cpu could use more than one instruction to implement an op, and knowledge of this is above my pay grade. I assume that all 32 bit writes and reads are atomic. I assume that increments are not, because they generally are performed with a read and a write.
But check this out, when compiled in 64 bit
! int x = 10;
main+11: movl $0xa,-0x14(%rbp)
! long y = 10;
main+18: movq $0xa,-0x10(%rbp)
! long long z = 10;
main+26: movq $0xa,-0x8(%rbp)
! y++;
main+34: addq $0x1,-0x10(%rbp)
! z = z + 10;
main+39: addq $0xa,-0x8(%rbp)
!}
!int main(int argc, char** argv)
!{
main+0: push %rbp
main+1: mov %rsp,%rbp
main+4: mov %edi,-0x24(%rbp)
main+7: mov %rsi,-0x30(%rbp)
! test();
! return (0);
main()
main+44: mov $0x0,%eax
!}
main+49: leaveq
main+50: retq
I'm guessing that addq could be atomic.
Most statements aren't atomic. Your thread may be interrupted in the middle of an ++i operation. The rule is that anything is not atomic unless it's specifically, explicitly defined as being atomic.