Function pointer runs faster than inline function. Why? - c++

I ran a benchmark of mine on my computer (Intel i3-3220 # 3.3GHz, Fedora 18), and got very unexpected results. A function pointer was actually a bit faster than an inline function.
Code:
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
std::chrono::duration<double> t;
int total=0;
for(int i=0;i<10000000;i++)
{
auto begin=std::chrono::high_resolution_clock::now();
short a=toBigEndian((short)i);//toBigEndianPtr((short)i);
total+=a;
auto end=std::chrono::high_resolution_clock::now();
t+=std::chrono::duration_cast<std::chrono::duration<double>>(end-begin);
}
std::cout<<t.count()<<", "<<total<<std::endl;
return 0;
}
compiled with
g++ test.cpp -std=c++0x -O0
The 'toBigEndian' loop finishes always at around 0.26-0.27 seconds, while 'toBigEndianPtr' takes 0.21-0.22 seconds.
What makes this even more odd is that when I remove 'total', the function pointer becomes the slower one at 0.35-0.37 seconds, while the inline function is at about 0.27-0.28 seconds.
My question is:
Why is the function pointer faster than the inline function when 'total' exists?

Short answer: it isn't.
You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.
So, to give your measurements any meaning,
optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of

A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).
Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.
Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.
Modified code follows:
#include <iostream>
static inline uint64_t rdtsc(void)
{
uint32_t hi, lo;
asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
uint64_t t = 0, begin=0, end=0;
int total=0;
begin=rdtsc();
for(int i=0;i<LOOP_COUNT;i++)
{
short a=0;
a=toBigEndianPtr((short)i);
//a=toBigEndian((short)i);
total+=a;
}
end=rdtsc();
t+=(end-begin);
std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
return 0;
}

Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
int total=0;
auto begin=std::chrono::high_resolution_clock::now();
for(int i=0;i<100000000;i++)
{
short a=toBigEndianPtr((short)i);
total+=a;
}
auto end=std::chrono::high_resolution_clock::now();
std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
return 0;
}
the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.

First off, with -O0, you aren't running the optimizer, which means the compiler is ignoring your request to inline, as it is free to do. The cost of the two different calls ought to be nearly identical. Try with -O2.
Second, if you are only running for 0.22 seconds, weirdly variable costs involved with starting your program totally dominate the cost of running the test function. That function call is just a few instructions. If your CPU is running at 2 GHz, it ought to execute that function call in something like 20 nanoseconds, so you can see that whatever it is you're measuring, it's not the cost of running that function.
Try calling the test function in a loop, say 1,000,000 times. Make the number of loops 10x bigger until it takes > 10 seconds to run the test. Then divide the result by the number of loops for an approximation of the cost of the operation.

With many/most self-respecting modern compilers, the code you posted will still inline the function call even when when it is called through the pointer. (Assuming the compiler makes a reasonable effort to optimize the code). The situation is just too easy to see through. In other words, the generated code can easily end up virtually the same in both cases, meaning that your test is not really useful for measuring what you are trying to measure.
If you really want to make sure the call is physically performed through the pointer, you have to make an effort to "confuse" the compiler to the point where it can't figure out the pointer value at compile time. For example, make the pointer value run-time dependent, as in
toBigEndianPtr = rand() % 1000 != 0 ? toBigEndian : NULL;
or something along these lines. You can also declare your function pointer as volatile, which will typically cause a genuine through-the-pointer call each time as well as force the compiler to re-read the pointer value from memory on each iteration.

Related

Function pointer performance; slower on a single call than multiple calls?

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.
#include <chrono>
#include <iostream>
using namespace std;
__attribute__((noinline)) int plus_one(int x) {
return x + 1;
}
typedef int (*FUNC)(int);
#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);
auto constexpr g_count = 1000000;
__attribute__((noinline)) int speed_test_no_param() {
int r;
auto local_lambda = [](int a) {
return plus_one(a);
};
FUNC f = local_lambda;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_no_param");
return r;
}
__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
int r;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_with_param");
return r;
}
int main() {
int ret = 0;
auto main_lambda = [](int a) {
return plus_one(a);
};
ret += speed_test_no_param();
FUNC fp = main_lambda;
ret += speed_test_with_param(fp);
return ret;
}
Built on Ubuntu 20.04 with:
g++ -ggdb -ffunction-sections -O3 -std=c++17 -DNDEBUG=1 -DRELEASE=1 -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections -Wl,--start-group speed_test.o -Wl,--rpath='$ORIGIN' -Wl,--end-group
The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:
Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849
When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is:
0x000055555555534b call 0x555555555310 <plus_one(int)>
... whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:
0x000055555555544e call QWORD PTR [rbx]
...
0x0000555555555324 jmp 0x555555555310 <plus_one(int)>
(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)
What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:
Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31
I have also run this many times; the parameter version is always the fastest.
I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.
Can someone please explain?
UPDATE
As per the comment below, ordering matters. When I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!
With multiple loop iterations in the C++ source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.
Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?
Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?
Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.
It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.
You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program
The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.
Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.
An indirect call can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.
So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.
If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.
(Microbenchmarking is hard, especially when you're trying to benchmark a C++ concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the + operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)
You're benchmarking a single iteration, which subjects you to cache effects and other warmup costs. The entire reason we normally run benchmarks several times is to amortize out these kinds of effects.
Caching refers to the memory hierarchy: your actual RAM is significantly slower than your CPU (and disk even more so), so to speed things up your CPU has a cache (often, multiple caches) which stores the most recently accessed bits of memory. The first time you start your program, it will need to be loaded from disk into RAM; thereafter, it will need to be loaded from RAM into the CPU caches. Uncached memory accesses can be orders of magnitudes slower than cached memory accesses. As your program runs, various bits of code and data will be loaded from RAM and cached; hence, subsequent executions of the same bit of code will often be faster than the first execution.
Other effects can include things like lazy dynamic linking and lazy initializations, wherein certain functions will perform extra work the first time they're called (for example, resolving dynamic library loads or initializing static data). These can all contribute to the first iteration being slower than subsequent iterations.
To address these issues, always make sure to run your benchmarks multiple times - and when possible, run your entire benchmark suite a few times in one process and take the lowest (fastest) run.

Tail call optimisation seems to slightly worsen performance

In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.
size of array with TLO(ms) without TLO(ms)
8M 35,083 34,051
4M 8,952 8,627
1M 613 609
Below is my code:
#include <bits/stdc++.h>
using namespace std;
int N = 4000000;
void qsort(int* arr,int start=0,int finish=N-1){
if(start>=finish) return ;
int i=start+1,j = finish,temp;
auto pivot = arr[start];
while(i!=j){
while (arr[j]>=pivot && j>i) --j;
while (arr[i]<pivot && i<j) ++i;
if(i==j) break;
temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
}
if(arr[i]>=arr[start]) --i;
temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
qsort(arr,start,i-1);
qsort(arr,i+1,finish);
}
int main(){
srand(time(NULL));
int* arr = new int[N];
for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}
auto start = clock();
qsort(arr);
cout<<(clock()-start)<<endl;
return 0;
}
I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.
EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?
On speed:
As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.
However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.
If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.
On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.
The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.
So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.
I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.
There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):
//fib.cpp
#include <iostream>
unsigned long long int fib(unsigned long long int n){
if (n==0 || n==1)
return 1;
return fib(n-1)+fib(n-2);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fib(N);
}
running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.
On tail call optimization:
Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form
ResType f( InputType input){
//do work
InputType new_input = ...;
return f(new_input);
}
There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.
Let's take look at this simple and standard implementation of tail call recursion:
//should be called fac(n,1)
unsigned long long int
fac(unsigned long long int n, unsigned long long int res_so_far){
if (n==0)
return res_so_far;
return fac(n-1, res_so_far*n);
}
This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!
However, the gcc compiler is able to perform the TCO also for less obvious cases:
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!
At hand of this example, we can see the result of TCO at work:
//factorial.cpp
#include <iostream>
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fac(N);
}
Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.
Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.
Corollary:
As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.
However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.
That means you should check for which part of the array you call the quicksort first as it plays a huge role.

Timing of using variables passed by reference and by value in C++

I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)

10 milli-second C++ excution time

I try to find out the exact execution time for "for loop" with 2e6 iteritions.
The following code is ran within 10ms after compiled from g++ for c++ file.
People told me that is optimization code automatically done by C++ compiler so you
get meaningless execution time. In other words,since there is no any output call
such as printf or cout<< for variable a,b,c so the optimized code will do nothing for
that "for loop" that is why I got really short program execution time in 10ms. Right ? Why they said the time result is meaningless for "for loop".
Please advise
int main(){
int max = 2e6;
int a,b,c;
// CODE YOU WANT TO TIME
int start = getMilliCount();
for (int i = 0; i < max; i++) {
a = 1234 + 5678 + i;
b = 1234 * 5678 + i;
c=1234/2+i;
}
int milliSecondsElapsed = getMilliSpan(start);
printf("\n\nElapsed time = %u milliseconds %d\n", milliSecondsElapsed,max);
return 0;
}
The run-time is absolutely not meaningless. It proves at least one important point: the optimizer is smarter than given credit, and it's able to deduce the loop has no side effects, so it cuts it out.
So even if the profile result only proves this one thing, it does have meaning.
To address what you want:
I try to find out the exact execution time for "for loop" with 2e8 iteritions.
The execution time of a for loop with 2e8 can be 0 if there are no observable effects. Or very large if they are. That's why you usually profile actual code using dedicated tools.
The compiler can change the program in any way that does not change anything observable, i.e. all outputs etc. must be exactly the same as the outputs of the un-optimized code. In your example, the compiler may notice that the values of a, b and c after the loop are never used and the loop does nothing else, so it might as well remove the loop from your program.
It could also observe that the value of the variables depend directly on max and just skip all but the last iteration.
In both cases, the result would not depend on max. It still is not meaningless, it just means that you underestimate your compiler.
Edit:
I tested this scenario with g++ -O2, the loop gets completely removed and does not run at all.

Inline speed and compiler optimization

I'm doing a bit of hands on research surrounding the speed benefits of making a function inline. I don't have the book with me, but one text I was reading, was suggesting a fairly large overhead cost to making function calls; and when ever executable size is either negligible, or can be spared, a function should be declared inline, for speed.
I've written the following code to test this theory, and from what I can tell, there is no speed benifit from declaring a function as inline. Both functions, when called 4294967295 times, on my computer, execute in 196 seconds.
My question is, what would be your thoughts as to why this is happening? Is it modern compiler optimization? Would it be the lack of large calculations taking place in the function?
Any insight on the matter would be appreciated. Thanks in advance friends.
#include < iostream >
#include < time.h >
// RESEARCH Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
// Two functions that preform an identacle arbitrary floating point calculation
// one function is inline, the other is not.
double test(double a, double b, double c);
double inlineTest(double a, double b, double c);
double test(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
inline
double inlineTest(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
// ENTRY POINT Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
int main(){
const unsigned int maxUINT = -1;
clock_t start = clock();
//============================ NON-INLINE TEST ===============================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
clock_t end = clock();
std::cout << maxUINT << " calls to non inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
start = clock();
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
end = clock();
std::cout << maxUINT << " calls to inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
getchar(); // Wait for input.
return 0;
} // Main.
Assembly Output
PasteBin
The inline keyword is basically useless. It is a suggestion only. The compiler is free to ignore it and refuse to inline such a function, and it is also free to inline a function declared without the inline keyword.
If you are really interested in doing a test of function call overhead, you should check the resultant assembly to ensure that the function really was (or wasn't) inlined. I'm not intimately familiar with VC++, but it may have a compiler-specific method of forcing or prohibiting the inlining of a function (however the standard C++ inline keyword will not be it).
So I suppose the answer to the larger context of your investigation is: don't worry about explicit inlining. Modern compilers know when to inline and when not to, and will generally make better decisions about it than even very experienced programmers. That's why the inline keyword is often entirely ignored. You should not worry about explicitly forcing or prohibiting inlining of a function unless you have a very specific need to do so (as a result of profiling your program's execution and finding that a bottleneck could be solved by forcing an inline that the compiler has for some reason not done).
Re: the assembly:
; 30 : const unsigned int maxUINT = -1;
; 31 : clock_t start = clock();
mov esi, DWORD PTR __imp__clock
push edi
call esi
mov edi, eax
; 32 :
; 33 : //============================ NON-INLINE TEST ===============================//
; 34 : for(unsigned int i = 0; i < maxUINT; ++i)
; 35 : blank(1.1,2.2,3.3);
; 36 :
; 37 : clock_t end = clock();
call esi
This assembly is:
Reading the clock
Storing the clock value
Reading the clock again
Note what's missing: calling your function a whole bunch of times
The compiler has noticed that you don't do anything with the result of the function and that the function has no side-effects, so it is not being called at all.
You can likely get it to call the function anyway by compiling with optimizations off (in debug mode).
Both the functions could be inlined. The definition of the non-inline function is in the same compilation unit as the usage point, so the compiler is within its rights to inline it even without you asking.
Post the assembly and we can confirm it for you.
EDIT: the MSVC compiler pragma for banning inlining is:
#pragma auto_inline(off)
void myFunction() {
// ...
}
#pragma auto_inline(on)
Two things could be happening:
The compiler may either be inlining both or neither functions. Check your compiler documentation for how to control that.
Your function may be complex enough that the overhead of doing the function call isn't big enough to make a big difference in the tests.
Inlining is great for very small functions but it's not always better. Code bloat can prevent the CPU from caching code.
In general inline getter/setter functions and other one liners. Then during performance tuning you can try inlining functions if you think you'll get a boost.
Your code as posted contains a couple oddities.
1) The math and output of your test functions are completely independent of the function parameters. If the compiler is smart enough to detect that those functions always return the same value, that might give it incentive to optimize them out entirely inline or not.
2) Your main function is calling test for both the inline and non-inline tests. If this is the actual code that you ran, then that would have a rather large role to play in why you saw the same results.
As others have suggested, you would do well to examine the actual assembly code generated by the compiler to determine that you're actually testing what you intended to.
Um, shouldn't
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
be
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
inlineTest(1.1,2.2,3.3);
?
But if that was just a typo, would recommend that look at a dissassembler or reflector to see if the code is actually inline or still stack-ed.
If this test took 196 seconds for each loop, then you must not have turned optimizations on; with optimizations off, generally compilers don't inline anything.
With optimization on, however, the compiler is free to notice that your test function can be completely evaluated at compile time, and crush it down to "return [constant]" -- at which point, it may well decide to inline both functions since they're so trivial, and then notice that the loops are pointless since the function value is not used, and squash that out too! This is basically what I got when I tried it.
So either way, you're not testing what you thought you tested.
Function call overhead ain't what it used to be, compared to the overhead of blowing out the level-1 instruction cache, which is what aggressive inlining does to you. You can easily find reports online of gcc's -Os option (optimize for size) being a better default choice for large projects than -O2, and the big reason for that is that -O2 inlines a lot more aggressively. I would expect it is much the same with MSVC.
The only way I know of to guarantee a function is inline is to #define it
For example:
#define RADTODEG(x) ((x) * 57.29578)
That said, the only time I would bother with such a function would be in an embedded system. On a desktop/server the performance difference is negligible.
Run it in a debugger and have a look at the generated code to see if your function is always or never inlined. I think it's always a good idea to have a look at the assembler code when you want more knowledge about the optimization the compiler does.
Apologies for a small flame ...
Compilers think in assembly language. You should too. Whatever else you do, just step through the code at the assembler level. Then you'll know exactly what the compiler did.
Don't think of performance in absolute terms like "fast" or "slow". It's all relative, percentage-wise. The way software is made fast is by removing, in successive steps, things that take too large a percent of the time.
Here's the flame: If a compiler can do a pretty good job of inlining functions that clearly need it, and if it can do a really good job of managing registers, I think that's just what it should do. If it can do a reasonable job of unrolling loops that clearly could use it, I can live with that. If it's knocking itself out trying to outsmart me by removing function calls that I clearly wrote and intended to be called, or scrambling my code sanctimoniously trying to save a JMP when that JMP occupies 0.000001% of running time (the way Fortran does), I get annoyed, frankly.
There seems to be a notion in the compiler world that there's no such thing as an unhelpful optimization. No matter how smart the compiler is, real optimization is the programmer's job, and nobody else's.