Can linux context switch after unlock in the below code if so we have a problem if two threads call this
inline bool CMyAutoLock::Lock(
pthread_mutex_t *pLock,
bool bBlockOk
)
throw ()
{
Unlock();
if (pLock == NULL)
return (false);
// **** can context switch happen here ? ****///
return ((((bBlockOk)? pthread_mutex_lock(pLock) :
pthread_mutex_trylock(pLock)) == 0)? (m_pLock = pLock, true) : false);
}
No, it's not atomic.
In fact, it may be especially likely for a context switch to occur after you've unlocked a mutex, because the OS knows if another thread is blocked on that mutex. (On the other hand, the OS doesn't even know whether you're executing an inline function.)
Inline functions are not automatically atomic. The inline keyword just means "when compiling this code, try to optimize away the call by replacing the call with the assembly instructions from the body of the call." You could get context-switched out on any of those assembly instructions just as you could in any other assembly instructions, and so you will need to guard the code with a lock.
inline is a complier hint that that suggests that the compiler inline the code into the caller rather than using function call semantics. However, it's just a hint, and isn't always heeded.
Futhermore, even if heeded, the result is that your code gets inlined into the calling function. It doesn't turn your code into an atomic sequence of instructions.
Inline makes a function work like a macro. Inline is not related to atomic in any way.
AFAIK inline is a hint and gcc might ignore it. When inlining, the code from your inline func, call it B, is copied into the caller func, A. There will be no call from A to B. This probably makes your exe faster at the expense of becomming larger. Probably? The exe could become smaller if your inline function is small. The exe could become slower if the inlining made it more difficult to optimize func A. If you don't specify inline, gcc will make the inline decision for you in a lot of cases. Member functions of classes are default inline. You need to explicitly tell gcc to not do automatic inlines. Also, gcc will not inline when optimizations are off.
The linker wont inline. So if module A extern referenced a function marked as inline, but the code was in module B, Module A will make calls to the function rather than inlining it. You have to define the function in the header file, and you have to declare it as extern inline func foo(a,b,c). Its actually a lot more complicated.
inline void test(void); // __attribute__((always_inline));
inline void test(void)
{
int x = 10;
long y = 10;
long long z = 10;
y++;
z = z + 10;
}
int main(int argc, char** argv)
{
test();
return (0);
}
Not Inline:
!{
main+0: lea 0x4(%esp),%ecx
main+4: and $0xfffffff0,%esp
main+7: pushl -0x4(%ecx)
main+10: push %ebp
main+11: mov %esp,%ebp
main+13: push %ecx
main+14: sub $0x4,%esp
! test();
main+17: call 0x8048354 <test> <--- making a call to test.
! return (0);
main()
main+22: mov $0x0,%eax
!}
main+27: add $0x4,%esp
main+30: pop %ecx
main+31: pop %ebp
main+32: lea -0x4(%ecx),%esp
main+35: ret
Inline:
inline void test(void)__attribute__((always_inline));
! int x = 10;
main+17: movl $0xa,-0x18(%ebp) <-- hey this is test code....in main()!
! long y = 10;
main+24: movl $0xa,-0x14(%ebp)
! long long z = 10;
main+31: movl $0xa,-0x10(%ebp)
main+38: movl $0x0,-0xc(%ebp)
! y++;
main+45: addl $0x1,-0x14(%ebp)
! z = z + 10;
main+49: addl $0xa,-0x10(%ebp)
main+53: adcl $0x0,-0xc(%ebp)
!}
!int main(int argc, char** argv)
!{
main+0: lea 0x4(%esp),%ecx
main+4: and $0xfffffff0,%esp
main+7: pushl -0x4(%ecx)
main+10: push %ebp
main+11: mov %esp,%ebp
main+13: push %ecx
main+14: sub $0x14,%esp
! test(); <-- no jump here
! return (0);
main()
main+57: mov $0x0,%eax
!}
main+62: add $0x14,%esp
main+65: pop %ecx
main+66: pop %ebp
main+67: lea -0x4(%ecx),%esp
main+70: ret
The only functions you can be sure are atomic are the gcc atomic builtins. Probably simple one opcode assembly instructions are atomic as well, but they might not be. In my experience so far on 6x86 setting or reading a 32 bit integer is atomic. You can guess if a line of c code could be atomic by looking at the generated assembly code.
The above code was compile in 32 bit mode. You can see that the long long takes 2 opcodes to load up. I am guessing that isn't atomic. The ints and longs take one opcode to set. Probably atomic. y++ is implemented with addl, which is probably atomic. I keep saying probably because the microcode on the cpu could use more than one instruction to implement an op, and knowledge of this is above my pay grade. I assume that all 32 bit writes and reads are atomic. I assume that increments are not, because they generally are performed with a read and a write.
But check this out, when compiled in 64 bit
! int x = 10;
main+11: movl $0xa,-0x14(%rbp)
! long y = 10;
main+18: movq $0xa,-0x10(%rbp)
! long long z = 10;
main+26: movq $0xa,-0x8(%rbp)
! y++;
main+34: addq $0x1,-0x10(%rbp)
! z = z + 10;
main+39: addq $0xa,-0x8(%rbp)
!}
!int main(int argc, char** argv)
!{
main+0: push %rbp
main+1: mov %rsp,%rbp
main+4: mov %edi,-0x24(%rbp)
main+7: mov %rsi,-0x30(%rbp)
! test();
! return (0);
main()
main+44: mov $0x0,%eax
!}
main+49: leaveq
main+50: retq
I'm guessing that addq could be atomic.
Most statements aren't atomic. Your thread may be interrupted in the middle of an ++i operation. The rule is that anything is not atomic unless it's specifically, explicitly defined as being atomic.
Related
I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)
Suppose I have a number of statements that I want to execute in
a fixed order. I want to use g++ with optimization level 2, so some
statements could be reordered. What tools does one have to enforce a certain ordering of statements?
Consider the following example.
using Clock = std::chrono::high_resolution_clock;
auto t1 = Clock::now(); // Statement 1
foo(); // Statement 2
auto t2 = Clock::now(); // Statement 3
auto elapsedTime = t2 - t1;
In this example it is important that the statements 1-3 are executed in
the given order. However, can't the compiler think statement 2 is
independent of 1 and 3 and execute the code as follows?
using Clock=std::chrono::high_resolution_clock;
foo(); // Statement 2
auto t1 = Clock::now(); // Statement 1
auto t2 = Clock::now(); // Statement 3
auto elapsedTime = t2 - t1;
I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.
Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.
We could try to prevent this, but it would have extremely negative results and would ultimately fail.
First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.
But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.
The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.
But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk
The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization
The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208
#include <chrono>
template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
asm volatile("" : "+m"(const_cast<T &>(value)));
}
// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }
auto time_foo() {
using Clock = std::chrono::high_resolution_clock;
auto input = 42;
auto t1 = Clock::now(); // Statement 1
DoNotOptimize(input);
auto output = foo(input); // Statement 2
DoNotOptimize(output);
auto t2 = Clock::now(); // Statement 3
return t2 - t1;
}
Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:
% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
.text
.file "so.cpp"
.globl _Z8time_foov
.p2align 4, 0x90
.type _Z8time_foov,#function
_Z8time_foov: # #_Z8time_foov
.cfi_startproc
# BB#0: # %entry
pushq %rbx
.Ltmp0:
.cfi_def_cfa_offset 16
subq $16, %rsp
.Ltmp1:
.cfi_def_cfa_offset 32
.Ltmp2:
.cfi_offset %rbx, -16
movl $42, 8(%rsp)
callq _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, %rbx
#APP
#NO_APP
movl 8(%rsp), %eax
addl %eax, %eax # This is "foo"!
movl %eax, 12(%rsp)
#APP
#NO_APP
callq _ZNSt6chrono3_V212system_clock3nowEv
subq %rbx, %rax
addq $16, %rsp
popq %rbx
retq
.Lfunc_end0:
.size _Z8time_foov, .Lfunc_end0-_Z8time_foov
.cfi_endproc
.ident "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
.section ".note.GNU-stack","",#progbits
Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.
Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.
Summary:
There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.
Original answer:
GCC reorders the calls under -O2 optimisation:
#include <chrono>
static int foo(int x) // 'static' or not here doesn't affect ordering.
{
return x*2;
}
int fred(int x)
{
auto t1 = std::chrono::high_resolution_clock::now();
int y = foo(x);
auto t2 = std::chrono::high_resolution_clock::now();
return y;
}
GCC 5.3.0:
g++ -S --std=c++11 -O0 fred.cpp :
_ZL3fooi:
pushq %rbp
movq %rsp, %rbp
movl %ecx, 16(%rbp)
movl 16(%rbp), %eax
addl %eax, %eax
popq %rbp
ret
_Z4fredi:
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
movl %ecx, 16(%rbp)
call _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, -16(%rbp)
movl 16(%rbp), %ecx
call _ZL3fooi
movl %eax, -4(%rbp)
call _ZNSt6chrono3_V212system_clock3nowEv
movq %rax, -32(%rbp)
movl -4(%rbp), %eax
addq $64, %rsp
popq %rbp
ret
But:
g++ -S --std=c++11 -O2 fred.cpp :
_Z4fredi:
pushq %rbx
subq $32, %rsp
movl %ecx, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
call _ZNSt6chrono3_V212system_clock3nowEv
leal (%rbx,%rbx), %eax
addq $32, %rsp
popq %rbx
ret
Now, with foo() as an extern function:
#include <chrono>
int foo(int x);
int fred(int x)
{
auto t1 = std::chrono::high_resolution_clock::now();
int y = foo(x);
auto t2 = std::chrono::high_resolution_clock::now();
return y;
}
g++ -S --std=c++11 -O2 fred.cpp :
_Z4fredi:
pushq %rbx
subq $32, %rsp
movl %ecx, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
movl %ebx, %ecx
call _Z3fooi
movl %eax, %ebx
call _ZNSt6chrono3_V212system_clock3nowEv
movl %ebx, %eax
addq $32, %rsp
popq %rbx
ret
BUT, if this is linked with -flto (link-time optimisation):
0000000100401710 <main>:
100401710: 53 push %rbx
100401711: 48 83 ec 20 sub $0x20,%rsp
100401715: 89 cb mov %ecx,%ebx
100401717: e8 e4 ff ff ff callq 100401700 <__main>
10040171c: e8 bf f9 ff ff callq 1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
100401721: e8 ba f9 ff ff callq 1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
100401726: 8d 04 1b lea (%rbx,%rbx,1),%eax
100401729: 48 83 c4 20 add $0x20,%rsp
10040172d: 5b pop %rbx
10040172e: c3 retq
Reordering may be done by the compiler, or by the processor.
Most compilers offer a platform-specific method to prevent reordering of read-write instructions. On gcc, this is
asm volatile("" ::: "memory");
(More information here)
Note that this only indirectly prevents reordering operations, as long as they depend on the reads / writes.
In practice I haven't yet seen a system where the system call in Clock::now() does have the same effect as such a barrier. You could inspect the resulting assembly to be sure.
It is not uncommon, however, that the function under test gets evaluated during compile time. To enforce "realistic" execution, you may need to derive input for foo() from I/O or a volatile read.
Another option would be to disable inlining for foo() - again, this is compiler specific and usually not portable, but would have the same effect.
On gcc, this would be __attribute__ ((noinline))
#Ruslan brings up a fundamental issue: How realistic is this measurement?
Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.
So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.
"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)
The C++ language defines what is observable in a number of ways.
If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.
If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.
While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.
Create a dynamically loaded library. Load it prior to the code in question.
That library exposes one thing:
namespace details {
void execute( void(*)(void*), void *);
}
and wraps it like this:
template<class F>
void execute( F f ) {
struct bundle_t {
F f;
} bundle = {std::forward<F>(f)};
auto tmp_f = [](void* ptr)->void {
auto* pb = static_cast<bundle_t*>(ptr);
(pb->f)();
};
details::execute( tmp_f, &bundle );
}
which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.
Inside the dynamic library, we do:
void details::execute( void(*f)(void*), void *p) {
f(p);
}
which is pretty simple.
Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.
It can still eliminate foo()s with zero side effects, but you win some, you lose some.
No it can't. According to the C++ standard [intro.execution]:
14 Every value computation and side effect associated with a
full-expression is sequenced before every value computation and side
effect associated with the next full-expression to be evaluated.
A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).
Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.
No.
Sometimes, by the "as-if" rule, statements may be re-ordered. This is not because they are logically independent of each other, but because that independence allows such a re-ordering to occur without changing the semantics of the program.
Moving a system call that obtains the current time obviously does not satisfy that condition. A compiler that knowingly or unknowingly does so is non-compliant and really silly.
In general, I wouldn't expect any expression that results in a system call to be "second-guessed" by even an aggressively optimizing compiler. It just doesn't know enough about what that system call does.
noinline function + inline assembly black box + full data dependencies
This is based on https://stackoverflow.com/a/38025837/895245 but because I didn't see any clear justification of why the ::now() cannot be reordered there, I would rather be paranoid and put it inside a noinline function together with the asm.
This way I'm pretty sure the reordering cannot happen, since the noinline "ties" the the ::now and the data dependency.
main.cpp
#include <chrono>
#include <iostream>
#include <string>
// noinline ensures that the ::now() cannot be split from the __asm__
template <class T>
__attribute__((noinline)) auto get_clock(T& value) {
// Make the compiler think we actually use / modify the value.
// It can't "see" what is going on inside the assembly string.
__asm__ __volatile__ ("" : "+g" (value));
return std::chrono::high_resolution_clock::now();
}
template <class T>
static T foo(T niters) {
T result = 42;
for (T i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
return result;
}
int main(int argc, char **argv) {
unsigned long long input;
if (argc > 1) {
input = std::stoull(argv[1], NULL, 0);
} else {
input = 1;
}
// Must come before because it could modify input
// which is passed as a reference.
auto t1 = get_clock(input);
auto output = foo(input);
// Must come after as it could use the output.
auto t2 = get_clock(output);
std::cout << "output " << output << std::endl;
std::cout << "time (ns) "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
<< std::endl;
}
GitHub upstream.
Compile and run:
g++ -ggdb3 -O3 -std=c++14 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out 1000
./main.out 10000
./main.out 100000
The only minor downside of this method is that we add one extra callq instruction over an inline method. objdump -CD shows that main contains:
11b5: e8 26 03 00 00 callq 14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
11ba: 48 8b 34 24 mov (%rsp),%rsi
11be: 48 89 c5 mov %rax,%rbp
11c1: b8 2a 00 00 00 mov $0x2a,%eax
11c6: 48 85 f6 test %rsi,%rsi
11c9: 74 1a je 11e5 <main+0x65>
11cb: 31 d2 xor %edx,%edx
11cd: 0f 1f 00 nopl (%rax)
11d0: 48 8d 48 fd lea -0x3(%rax),%rcx
11d4: 48 83 c2 01 add $0x1,%rdx
11d8: 48 0f af c1 imul %rcx,%rax
11dc: 48 83 c0 01 add $0x1,%rax
11e0: 48 39 d6 cmp %rdx,%rsi
11e3: 75 eb jne 11d0 <main+0x50>
11e5: 48 89 df mov %rbx,%rdi
11e8: 48 89 44 24 08 mov %rax,0x8(%rsp)
11ed: e8 ee 02 00 00 callq 14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
so we see that foo was inlined, but get_clock were not and surround it.
get_clock itself however is extremely efficient, consisting of a single leaf call optimized instruction that doesn't even touch the stack:
00000000000014e0 <auto get_clock<unsigned long long>(unsigned long long&)>:
14e0: e9 5b fb ff ff jmpq 1040 <std::chrono::_V2::system_clock::now()#plt>
Since the clock precision is itself limited, I think that is unlikely that you will be able to notice the timing effects of one extra jmpq. Note that one call is required regardless since ::now() is in a shared library.
Call ::now() from inline assembly with a data dependency
This would be the most efficient solution possible, overcoming even the extra jmpq mentioned above.
This is unfortunately extremely hard to do correctly as shown at: Calling printf in extended inline ASM
If your time measurement can be done directly in inline assembly without a call however, then this technique can be used. This is the case for example for gem5 magic instrumentation instructions, x86 RDTSC (not sure if this is representative anymore) and possibly other performance counters.
Related threads:
Is it legal for a C++ optimizer to reorder calls to clock()?
Tested with GCC 8.3.0, Ubuntu 19.04.
Consider this code:
struct A{
volatile int x;
A() : x(12){
}
};
A foo(){
A ret;
//Do stuff
return ret;
}
int main()
{
A a;
a.x = 13;
a = foo();
}
Using g++ -std=c++14 -pedantic -O3 I get this assembly:
foo():
movl $12, %eax
ret
main:
xorl %eax, %eax
ret
According to my estimation the variable x should be written to at least three times (possibly four), yet it not even written once (the function foo isn't even called!)
Even worse when you add the inline keyword to foo this is the result:
main:
xorl %eax, %eax
ret
I thought that volatile means that every single read or write must happen even if the compiler can not see the point of the read/write.
What is going on here?
Update:
Putting the declaration of A a; outside main like this:
A a;
int main()
{
a.x = 13;
a = foo();
}
Generates this code:
foo():
movl $12, %eax
ret
main:
movl $13, a(%rip)
xorl %eax, %eax
movl $12, a(%rip)
ret
movl $12, a(%rip)
ret
a:
.zero 4
Which is closer to what you would expect....I am even more confused then ever
Visual C++ 2015 does not optimize away the assignments:
A a;
mov dword ptr [rsp+8],0Ch <-- write 1
a.x = 13;
mov dword ptr [a],0Dh <-- write2
a = foo();
mov dword ptr [a],0Ch <-- write3
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
}
xor eax,eax
ret
The same happens both with /O2 (Maximize speed) and /Ox (Full optimization).
The volatile writes are kept also by gcc 3.4.4 using both -O2 and -O3
_main:
pushl %ebp
movl $16, %eax
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call __alloca
call ___main
movl $12, -4(%ebp) <-- write1
xorl %eax, %eax
movl $13, -4(%ebp) <-- write2
movl $12, -8(%ebp) <-- write3
leave
ret
Using both of these compilers, if I remove the volatile keyword, main() becomes essentially empty.
I'd say you have a case where the compiler over-agressively (and incorrectly IMHO) decides that since 'a' is not used, operations on it arent' necessary and overlooks the volatile member. Making 'a' itself volatile could get you what you want, but as I don't have a compiler that reproduces this, I can't say for sure.
Last (while this is admittedly Microsoft specific), https://msdn.microsoft.com/en-us/library/12a04hfd.aspx says:
If a struct member is marked as volatile, then volatile is propagated to the whole structure.
Which also points towards the behavior you are seeing being a compiler problem.
Last, if you make 'a' a global variable, it is somewhat understandable that the compiler is less eager to deem it unused and drop it. Global variables are extern by default, so it is not possible to say that a global 'a' is unused just by looking at the main function. Some other compilation unit (.cpp file) might be using it.
GCC's page on Volatile access gives some insight into how it works:
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point. The use of volatile does not allow you to violate the restriction on updating objects multiple times between two sequence points.
In C standardese:
ยง5.1.2.3
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, 11) which are changes in the state of the
execution environment. Evaluation of an expression may produce side
effects. At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have
taken place. (A summary of the sequence points is given in annex C.)
3 In the abstract machine, all expressions are evaluated as specified
by the semantics. An actual implementation need not evaluate part of
an expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
[...]
5 The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet
occurred. [...]
I chose the C standard because the language is simpler but the rules are essentially the same in C++. See the "as-if" rule.
Now on my machine, -O1 doesn't optimize away the call to foo(), so let's use -fdump-tree-optimized to see the difference:
-O1
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
a = foo ();
a ={v} {CLOBBER};
return 0;
}
And -O3:
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A ret;
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
ret.x ={v} 12;
ret ={v} {CLOBBER};
a ={v} {CLOBBER};
return 0;
}
gdb reveals in both cases that a is ultimately optimized out, but we're worried about foo(). The dumps show us that GCC reordered the accesses so that foo() is not even necessary and subsequently all of the code in main() is optimized out. Is this really true? Let's see the assembly output for -O1:
foo():
mov eax, 12
ret
main:
call foo()
mov eax, 0
ret
This essentially confirms what I said above. Everything is optimized out: the only difference is whether or not the call to foo() is as well.
Let's take this for example:
Class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClas::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClas::functionComplex()
{
/* many complex
instructions
*/
}
void myFunc()
{
TestClass testVar;
testVar.functionInline();
}
Suposing that all coments are in fact lines of code that are single line or many and complex lines of code. The equivalent code would be (after compilation):
void myFunc()
{
TestClass testVar;
// a single instruction
return functionComplex();
}
or would be:
void myFunc()
{
TestClass testVar;
// a single instruction
/* many complex
instructions
*/
}
In other words, would a normal function be inserted inline if called inside an inline function or not?
If the compiler can see that the function is not called anywhere else (e.g. it is static in the case of a free function), then at least gcc has inlined it for a long time.
Of course, this also assumes the compiler can actually "see" the source code of the function - only if you use "whole program optimisation" (available in at least MS and GCC compilers), does it inline functions that aren't either in the source file or headers included in the source.
Obviously, inlining a "large" function has very little benefit (because the overhead of making the call is such a small portion of the total runtime), and if the function gets called more than once (or "may be called more than once" by not being static), the compiler will almost certainly not inline a "large" function.
In summary: maybe the large function is inline, but quite likely not.
Please check the assembly code that I generated both for VC++ 2010 and g++.
Both the compilers dont actually treat any of the function as inline in this example.
Code:
class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClass::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClass::functionComplex()
{
/* many complex
instructions
*/
return 0;
}
int main(){
TestClass t;
t.functionInline();
return 0;
}
VC++ 2010:
int main(){
01372E50 push ebp
01372E51 mov ebp,esp
01372E53 sub esp,0CCh
01372E59 push ebx
01372E5A push esi
01372E5B push edi
01372E5C lea edi,[ebp-0CCh]
01372E62 mov ecx,33h
01372E67 mov eax,0CCCCCCCCh
01372E6C rep stos dword ptr es:[edi]
TestClass t;
t.functionInline();
01372E6E lea ecx,[t]
01372E71 call TestClass::functionInline (1371677h)
return 0;
01372E76 xor eax,eax
}
Linux G++:
main:
.LFB3:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
leaq -1(%rbp), %rax
movq %rax, %rdi
call _ZN9TestClass14functionInlineEv
movl $0, %eax
leave
ret
.cfi_endproc
Both the lines
01372E71 call TestClass::functionInline (1371677h)
and
call _ZN9TestClass14functionInlineEv
indicate that the function functionInline is not inline.
Now have a look at functionInline assembly:
inline int TestClass::functionInline()
{
01372E00 push ebp
01372E01 mov ebp,esp
01372E03 sub esp,0CCh
01372E09 push ebx
01372E0A push esi
01372E0B push edi
01372E0C push ecx
01372E0D lea edi,[ebp-0CCh]
01372E13 mov ecx,33h
01372E18 mov eax,0CCCCCCCCh
01372E1D rep stos dword ptr es:[edi]
01372E1F pop ecx
01372E20 mov dword ptr [ebp-8],ecx
// a single instruction
return functionComplex();
01372E23 mov ecx,dword ptr [this]
01372E26 call TestClass::functionComplex (1371627h)
}
Hence, functionComplex is not also inline.
No it would not be inlined. It is impossible beacause the compiler does not have available the body definition of a non-inlined function that could be located in another translation unit. I suppose normal function is a non-inlined function.
No, if you want your complex function be inserted inline, you must specify the inline keyword too.
In practice, use __forceinline keyword (on windows, __always_inline on linux) otherwise the compiler will ignore the keyword if there is a lot of instructions.
First of all inline function is just a directive to the compiler. It is not guaranteed that compiler will do the inlining.
Secondly, when you specify a function as inline, it tells compiler two things
1) Function might be a candidate for inlining. Whether it is going to be inlined is not guaranteed
2) This function has internal linkage. That is, function will be visible only in the translation unit it is compiled. This internal linkage is guaranteed irrespective of whether the function is actually inlined or not.
In you case, functionInline is specified as inline but functionComplex is not. functionComplex has external linkage. Compiler will never do the inlining of function with external linkage.
So, a plain answer to your question is "No" a normal (function without inline keyword and defined outside class)function will never be inlined
To get a better understanding of binary files I've prepared a small c++ example and used gdb to disassemble and look for the machine code.
The main() function calls the function func():
int func(void)
{
int a;
int b;
int c;
int d;
d = 4;
c = 3;
b = 2;
a = 1;
return 0;
}
The project is compiled with g++ keeping the debugging information. Next gdb is used to disassemble the source code. What I got for func() looks like:
0x00000000004004cc <+0>: push %rbp
0x00000000004004cd <+1>: mov %rsp,%rbp
0x00000000004004d0 <+4>: movl $0x4,-0x10(%rbp)
0x00000000004004d7 <+11>: movl $0x3,-0xc(%rbp)
0x00000000004004de <+18>: movl $0x2,-0x8(%rbp)
0x00000000004004e5 <+25>: movl $0x1,-0x4(%rbp)
0x00000000004004ec <+32>: mov $0x0,%eax
0x00000000004004f1 <+37>: pop %rbp
0x00000000004004f2 <+38>: retq
Now my problem is that I expect that the stack pointer should be moved by 16 bytes to lower addresses relative to the base pointer, since each integer value needs 4 bytes. But it looks like that the values are putted on the stack without moving the stack pointer.
What did I not understand correctly? Is this a problem with the compiler or did the assembler omit some lines?
Best regards,
NouGHt
There's absolutely no problem with your compiler. The compiler is free to choose how to compile your code, and it chose not to modify the stack pointer. There's no need for it to do so since your function doesn't call any other functions. If it did call another function then it would need to create another stack frame so that the callee did not stomp on the caller's stack frame.
As a general rule, you should avoid trying to make any assumptions on how the compiler will compile your code. For example, your compiler would be perfectly at liberty to opimize away the body of your function.