This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.
Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
Look at this function:
float process(float in) {
float out = in;
for (int i = 0; i < 31; ++i) {
if (biquads_[i]) {
out = biquads_[i]->filter(out);
}
}
return out;
}
biquads_ is a std::optional<Biquad>[31].
in this case i check for every optional to check if its not empty, and then call the filter function of biquad, if instead I unconditionally call filter function, changing it to multiply by 1 or simply return the input value, would be more efficient?
Most likely it won't make a shread of difference (guessing somewhat though since your question is not entirely clear). For two reasons: 1) unless the code is going to be used in a very hot path, it won't matter even if one way is a few nanoseconds faster than the other. 2) most likely your compilers optimizer will be clever enough to generate code that performs close-to (if not identical to) the same in both cases. Did you test it? Did you benchmark/profile it? If not; do so - with optimization enabled.
Strive to write clear, readable, maintainable code. Worry about micro-optimization later when you actually have a problem and your profiler points to your function as a hot-spot.
I noticed that Google's C++ style guide cautions against inlining functions with loops or switch statements:
Another useful rule of thumb: it's typically not cost effective to
inline functions with loops or switch statements (unless, in the
common case, the loop or switch statement is never executed).
Other comments on StackOverflow have reiterated this sentiment.
Why are functions with loops or switch statements (or gotos) not suitable for or compatible with inlining. Does this apply to functions that contain any type of jump? Does it apply to functions with if statements? Also (and this might be somewhat unrelated), why is inlining functions that return a value discouraged?
I am particularly interested in this question because I am working with a segment of performance-sensitive code. I noticed that after inlining a function that contains a series of if statements, performance degrades pretty significantly. I'm using GNU Make 3.81, if that's relevant.
Inlining functions with conditional branches makes it more difficult for the CPU to accurately predict the branch statements, since each instance of the branch is independent.
If there are several branch statements, successful branch prediction saves a lot more cycles than the cost of calling the function.
Similar logic applies to unrolling loops with switch statements.
The Google guide referenced doesn't mention anything about functions returning values, so I'm assuming that reference is elsewhere, and requires a different question with an explicit citation.
While in your case, the performance degradation seems to be caused by branch mispredictions, I don't think that's the reason why the Google style guide advocates against inline functions containing loops or switch statements. There are use cases where the branch predictor can benefit from inlining.
A loop is often executed hundreds of times, so the execution time of the loop is much larger than the time saved by inlining. So the performance benefit is negligible (see Amdahl's law). OTOH, inlining functions results in increase of code size which has negative effects on the instruction cache.
In the case of switch statements, I can only guess. The rationale might be that jump tables can be rather large, wasting much more memory in the code segment than is obvious.
I think the keyword here is cost effective. Functions that cost a lot of cycles or memory are typically not worth inlining.
The purpose of a coding style guide is to tell you that if you are reading it you are unlikely to have added an optimisation to a real compiler, even less likely to have added a useful optimisation (measured by other people on realistic programs over a range of CPUs), therefore quite unlikely to be able to out-guess the guys who did. At least, do not mislead them, for example, by putting the volatile keyword in front of all your variables.
Inlining decisions in a compiler have very little to do with 'Making a Simple Branch Predictor Happy'. Or less confused.
First off, the target CPU may not even have branch prediction.
Second, a concrete example:
Imagine a compiler which has no other optimisation (turned on) except inlining. Then the only positive effect of inlining a function is that bookkeeping related to function calls (saving registers, setting up locals, saving the return address, and jumping to and back) are eliminated. The cost is duplicating code at every single location where the function is called.
In a real compiler dozens of other simple optimisations are done and the hope of inlining decisions is that those optimisations will interact (or cascade) nicely. Here is a very simple example:
int f(int s)
{
...;
switch (s) {
case 1: ...; break;
case 2: ...; break;
case 42: ...; return ...;
}
return ...;
}
void g(...)
{
int x=f(42);
...
}
When the compiler decides to inline f, it replaces the RHS of the assignment with the body of f. It substitutes the actual parameter 42 for the formal parameter s and suddenly it finds that the switch is on a constant value...so it drops all the other branches and hopefully the known value will allow further simplifications (ie they cascade).
If you are really lucky all calls to the function will be inlined (and unless f is visible outside) the original f will completely disappear from your code. So your compiler eliminated all the bookkeeping and made your code smaller at compile time. And made the code more local at runtime.
If you are unlucky, the code size grows, locality at runtime decreases and your code runs slower.
It is trickier to give a nice example when it is beneficial to inline loops because one has to assume other optimisations and the interactions between them.
The point is that it is hellishly difficult to predict what happens to a chunk of code even if you know all the ways the compiler is allowed to change it. I can't remember who said it but one should not be able to recognise the executable code produced by an optimising compiler.
I think it might be worth to extend the example provided by #user1666959. I'll answer to provide cleaner example code.
Let's consider such scenario.
/// Counts odd numbers in range [0;number]
size_t countOdd(size_t number)
{
size_t result = 0;
for (size_t i = 0; i <= number; ++i)
{
result += (i % 2);
}
return result;
}
int main()
{
return countOdd(5);
}
If the function is not inlined and uses external linking, it will execute whole loop. Imagine what happens when you inline it.
int main()
{
size_t result = 0;
for (size_t i = 0; i <= 5; ++i)
{
result += (i % 2);
}
return result;
}
Now let's enable loop unfolding optimization. Here we know that it iterates from 0 to 5, so it can be easily unfolded removing unwanted conditions in the code.
int main()
{
size_t result = 0;
// iteration 0
size_t i = 0
result += (i % 2);
// iteration 1
++i
result += (i % 2);
// iteration 2
++i
result += (i % 2);
// iteration 3
++i
result += (i % 2);
// iteration 4
++i
result += (i % 2);
// iteration 5
++i
result += (i % 2);
return result;
}
No conditions, it is faster already but that's not all. We know the value of i, so why not passing it directly?
int main()
{
size_t result = 0;
// iteration 0
result += (0 % 2);
// iteration 1
result += (1 % 2);
// iteration 2
result += (2 % 2);
// iteration 3
result += (3 % 2);
// iteration 4
result += (4 % 2);
// iteration 5
result += (5 % 2);
return result;
}
Even simpler but whait, those operations are constexpr, we can calculate them during compilation.
int main()
{
size_t result = 0;
// iteration 0
result += 0;
// iteration 1
result += 1;
// iteration 2
result += 0;
// iteration 3
result += 1;
// iteration 4
result += 0;
// iteration 5
result += 1;
return result;
}
So now the compiler sees that some of those operations don't have any effects leaving only those, which change the value. After that it removes unnecessary temporary variables and performs as much calculations, as it can during compilation, your code ends up with:
int main()
{
return 3;
}
I'm a beginner in C++. Yesterday I read about recursive functions, so I decided to write my own. Here's what I wrote:
int returnZero(int anyNumber) {
if(anyNumber == 0)
return 0;
else {
anyNumber--;
return returnZero(anyNumber);
}
}
When I do this: int zero1 = returnZero(4793);, it causes a stack overflow. However, if I pass the value 4792 as the argument, no overflow occurs.
Any ideas as to why?
Whenever you call a function, including recursively, the return address and often the arguments are pushed onto the call stack. The stack is finite, so if the recursion is too deep you'll eventually run out of stack space.
What surprises me is that it only takes 4793 calls on your machine to overflow the stack. This is a pretty small stack. By way of comparison, running the same code on my computer requires ~100x as many calls before the program crashes.
The size of the stack is configurable. On Unix, the command is ulimit -s.
Given that the function is tail-recursive, some compilers might be able to optimize the recursive call away by turning it into a jump. Some compilers might take your example even further: when asked for maximum optimizations, gcc 4.7.2 transforms the entire function into:
int returnZero(int anyNumber) {
return 0;
}
This requires exactly two assembly instructions:
_returnZero:
xorl %eax, %eax
ret
Pretty neat.
You just hit the call stack's size limit of your system, that's what's happening. For some reason the stack in your system is tiny, a depth of 4793 function calls is rather small.
Your stack is limited in size and so when you make 4793 calls you are hitting the limit while 4792 just comes in under. Each function call will use some space on the stack for house keeping and maybe arguments.
This page gives an example of what a stack looks like during a recursive function call.
My guess is you stack is exactly big enough to fit 4792 entries - today. Tomorrow or the next, that number might be different. Recursive programming can be dangerous and this example illistrates why. We try not to let recursion get this deep or 'bad' things can happen.
Any "boundless" recursion, that is recursive calls that aren't naturally limited to a small(ish) number will have this effect. Exactly where the limit goes depends on the OS, the environment the function is called in (the compiler, which function calls the recursive function, etc, etc).
If you add another variable, say int x[10]; to your function that calls your recursive function, the number needed to crash it will change (probably by about 5 or so).
Compile it with a different compiler (or even different compiler settings, e.g. optimization turned on) and it will probably change again.
Using recursion, you can achieve SuperDigit:
public class SuperDigit
{
static int sum = 0;
int main()
{
int n = 8596854;
cout<<getSum(n);
}
int getSum(int n){
sum=0;
while (n > 0) {
int rem;
rem = n % 10;
sum = sum + rem;
n = n / 10;
getSum(n);
}
return sum;
}
}
I'm doing a bit of hands on research surrounding the speed benefits of making a function inline. I don't have the book with me, but one text I was reading, was suggesting a fairly large overhead cost to making function calls; and when ever executable size is either negligible, or can be spared, a function should be declared inline, for speed.
I've written the following code to test this theory, and from what I can tell, there is no speed benifit from declaring a function as inline. Both functions, when called 4294967295 times, on my computer, execute in 196 seconds.
My question is, what would be your thoughts as to why this is happening? Is it modern compiler optimization? Would it be the lack of large calculations taking place in the function?
Any insight on the matter would be appreciated. Thanks in advance friends.
#include < iostream >
#include < time.h >
// RESEARCH Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
// Two functions that preform an identacle arbitrary floating point calculation
// one function is inline, the other is not.
double test(double a, double b, double c);
double inlineTest(double a, double b, double c);
double test(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
inline
double inlineTest(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
// ENTRY POINT Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
int main(){
const unsigned int maxUINT = -1;
clock_t start = clock();
//============================ NON-INLINE TEST ===============================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
clock_t end = clock();
std::cout << maxUINT << " calls to non inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
start = clock();
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
end = clock();
std::cout << maxUINT << " calls to inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
getchar(); // Wait for input.
return 0;
} // Main.
Assembly Output
PasteBin
The inline keyword is basically useless. It is a suggestion only. The compiler is free to ignore it and refuse to inline such a function, and it is also free to inline a function declared without the inline keyword.
If you are really interested in doing a test of function call overhead, you should check the resultant assembly to ensure that the function really was (or wasn't) inlined. I'm not intimately familiar with VC++, but it may have a compiler-specific method of forcing or prohibiting the inlining of a function (however the standard C++ inline keyword will not be it).
So I suppose the answer to the larger context of your investigation is: don't worry about explicit inlining. Modern compilers know when to inline and when not to, and will generally make better decisions about it than even very experienced programmers. That's why the inline keyword is often entirely ignored. You should not worry about explicitly forcing or prohibiting inlining of a function unless you have a very specific need to do so (as a result of profiling your program's execution and finding that a bottleneck could be solved by forcing an inline that the compiler has for some reason not done).
Re: the assembly:
; 30 : const unsigned int maxUINT = -1;
; 31 : clock_t start = clock();
mov esi, DWORD PTR __imp__clock
push edi
call esi
mov edi, eax
; 32 :
; 33 : //============================ NON-INLINE TEST ===============================//
; 34 : for(unsigned int i = 0; i < maxUINT; ++i)
; 35 : blank(1.1,2.2,3.3);
; 36 :
; 37 : clock_t end = clock();
call esi
This assembly is:
Reading the clock
Storing the clock value
Reading the clock again
Note what's missing: calling your function a whole bunch of times
The compiler has noticed that you don't do anything with the result of the function and that the function has no side-effects, so it is not being called at all.
You can likely get it to call the function anyway by compiling with optimizations off (in debug mode).
Both the functions could be inlined. The definition of the non-inline function is in the same compilation unit as the usage point, so the compiler is within its rights to inline it even without you asking.
Post the assembly and we can confirm it for you.
EDIT: the MSVC compiler pragma for banning inlining is:
#pragma auto_inline(off)
void myFunction() {
// ...
}
#pragma auto_inline(on)
Two things could be happening:
The compiler may either be inlining both or neither functions. Check your compiler documentation for how to control that.
Your function may be complex enough that the overhead of doing the function call isn't big enough to make a big difference in the tests.
Inlining is great for very small functions but it's not always better. Code bloat can prevent the CPU from caching code.
In general inline getter/setter functions and other one liners. Then during performance tuning you can try inlining functions if you think you'll get a boost.
Your code as posted contains a couple oddities.
1) The math and output of your test functions are completely independent of the function parameters. If the compiler is smart enough to detect that those functions always return the same value, that might give it incentive to optimize them out entirely inline or not.
2) Your main function is calling test for both the inline and non-inline tests. If this is the actual code that you ran, then that would have a rather large role to play in why you saw the same results.
As others have suggested, you would do well to examine the actual assembly code generated by the compiler to determine that you're actually testing what you intended to.
Um, shouldn't
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
be
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
inlineTest(1.1,2.2,3.3);
?
But if that was just a typo, would recommend that look at a dissassembler or reflector to see if the code is actually inline or still stack-ed.
If this test took 196 seconds for each loop, then you must not have turned optimizations on; with optimizations off, generally compilers don't inline anything.
With optimization on, however, the compiler is free to notice that your test function can be completely evaluated at compile time, and crush it down to "return [constant]" -- at which point, it may well decide to inline both functions since they're so trivial, and then notice that the loops are pointless since the function value is not used, and squash that out too! This is basically what I got when I tried it.
So either way, you're not testing what you thought you tested.
Function call overhead ain't what it used to be, compared to the overhead of blowing out the level-1 instruction cache, which is what aggressive inlining does to you. You can easily find reports online of gcc's -Os option (optimize for size) being a better default choice for large projects than -O2, and the big reason for that is that -O2 inlines a lot more aggressively. I would expect it is much the same with MSVC.
The only way I know of to guarantee a function is inline is to #define it
For example:
#define RADTODEG(x) ((x) * 57.29578)
That said, the only time I would bother with such a function would be in an embedded system. On a desktop/server the performance difference is negligible.
Run it in a debugger and have a look at the generated code to see if your function is always or never inlined. I think it's always a good idea to have a look at the assembler code when you want more knowledge about the optimization the compiler does.
Apologies for a small flame ...
Compilers think in assembly language. You should too. Whatever else you do, just step through the code at the assembler level. Then you'll know exactly what the compiler did.
Don't think of performance in absolute terms like "fast" or "slow". It's all relative, percentage-wise. The way software is made fast is by removing, in successive steps, things that take too large a percent of the time.
Here's the flame: If a compiler can do a pretty good job of inlining functions that clearly need it, and if it can do a really good job of managing registers, I think that's just what it should do. If it can do a reasonable job of unrolling loops that clearly could use it, I can live with that. If it's knocking itself out trying to outsmart me by removing function calls that I clearly wrote and intended to be called, or scrambling my code sanctimoniously trying to save a JMP when that JMP occupies 0.000001% of running time (the way Fortran does), I get annoyed, frankly.
There seems to be a notion in the compiler world that there's no such thing as an unhelpful optimization. No matter how smart the compiler is, real optimization is the programmer's job, and nobody else's.