I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.
So, to see what would happen, I've created a test project like this:
main.cpp
#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"
int main(int argc, char* argv[])
{
int number = 10; // this will come from input, but let's keep it simple for now
int result;
if (std::string(argv[argc - 1]) == "--noavx")
result = FnNormal(number);
else
{
std::cout << "AVX selected\n";
result = FnAVX(number);
}
std::cout << "Double of " << number << " is " << result << std::endl;
return 0;
}
Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:
fn_normal.cpp
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble(num);
}
fn_avx.cpp
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
And here's the template function definition:
double.h
template<typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.
I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.
Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.
As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?
My compiler is MSVC 2013.
Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.
The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:
objdump -d fn_normal.obj:
...
movss -0x1f5c(%ebp,%eax,4),%xmm0
mulss -0x1f5c(%ebp,%ecx,4),%xmm0
mov -0x1f68(%ebp),%edx
mulss -0x1f5c(%ebp,%edx,4),%xmm0
mov -0x1f68(%ebp),%eax
movss %xmm0,-0xfb4(%ebp,%eax,4)
...
objdump -d fn_avx.obj:
...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...
The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.
The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:
double.h:
template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}
fn_avx.cpp:
#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).
When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in #celtschk's answer).
Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.
I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.
In MSVC++:
template<typename T>
__forceinline int RtDouble(T number) {...}
GCC:
template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}
Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.
I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:
File double.h:
template<bool avx, typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
File fn_normal.cpp:
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble<false>(num);
}
File fn_avx.cpp:
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble<true>(num);
}
Related
Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.
You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.
You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.
For constexpr functions the only option is to have recursive functions for anything but simple things. The problem with that is that recursive functions are expensive at run time (especially if you are going to be calling yourself a lot of times).
So is it possible to implement 2 functions one for constexpr and the other for normal use:
constexpr int fact(int x){ //Use this at compile time
return x == 0 ? 1 : fact(x-1)*x;
}
int fact(int x){ //Use this for real calls
int ret = 1;
for (int i = 1; i < x+1; i++){
ret *= i;
}
return ret;
}
And along the same lines can you make a special function for inline situations also?
Since C++14, the loop form is a valid constexpr as per (http://en.cppreference.com/w/cpp/language/constexpr), so the second form with constexpr added is valid.
Unfortunately not all compilers support this (The latest version of Visual C++ doesn't, but the latest Clang and GCC ones apparently do (but I am unable to test this)).
In which case you can either:
Rely on the compilers optimizations, and use the first version (you might want to test this for your specific compiler)
Give the two forms different names (such as fact_const for the constexpr function, and make sure you only use the constexpr version when it's arguments are also constexpr (I don't know how to actually check whether this is the case)
Wait till your compiler releases an update that supports this.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I use a class called Year to return a very distant from now year. In case 1. I keep the class together with the main routine in the same file called YearAllTogether.cpp and in case 2. I place the class in a Year.cpp along with its corresponding header, Year.h. When I run the 1st case where everything is together (YearAllTogether.cpp), I get a runtime of 7.4e-05 sec whereas when I run the code where the class is in its own file and header, the runtime gets up to a huge 1.84526 sec. What is happening/what am I missing here? To get some measurable runtime, I use a for-loop to use the class 10^9 times. I post below the code for the two cases:
Case 1. Class and main in the same .cpp file, that is, YearAllTogether.cpp.
I compile with g++ -Wall -O3 YearAllTogether.cpp
#include <iostream>
#include <sys/time.h>
using namespace std;
class Year
{
private:
long m_nYear;
Year() { };
public:
Year(long nYear);
void SetYear(long nYear);
long GetYear(){return m_nYear;};
};
Year::Year(long nYear)
{
SetYear(nYear);
};
void Year::SetYear(long nYear)
{
m_nYear = nYear;
};
int main ()
{
struct timeval tvalBefore, tvalAfter;
gettimeofday (&tvalBefore, NULL);
Year long_after(0);
for (long i=1; i<=1000000000; i++)
{
Year temp_year(i);
long_after = temp_year;
}
cout<<long_after.GetYear()<<"\n";
gettimeofday (&tvalAfter, NULL);
double runtime = (((tvalAfter.tv_sec - tvalBefore.tv_sec)*1000000L
+tvalAfter.tv_usec) - tvalBefore.tv_usec)/1000000.;
cout << "TIME "<<runtime<<" sec"<<endl;
return 0;
}
Case 2. Class and main in different files, here I compile with g++ -Wall -O3 Year.cpp YearMain.cpp
Year.h
#ifndef YEAR_H
#define YEAR_H
class Year
{
private:
long m_nYear;
Year() { };
public:
Year(long nYear);
void SetYear(long nYear);
long GetYear(){return m_nYear;};
};
#endif
Year.cpp
#include "Year.h"
using namespace std;
Year::Year(long nYear)
{
SetYear(nYear);
};
void Year::SetYear(long nYear)
{
m_nYear = nYear;
};
and YearMain.cpp
#include <iostream>
#include <sys/time.h>
#include "Year.h"
using namespace std;
int main ()
{
struct timeval tvalBefore, tvalAfter;
gettimeofday (&tvalBefore, NULL);
Year long_after(0);
for (long i=1; i<=1000000000; i++)
{
Year temp_year(i);
long_after = temp_year;
}
cout<<long_after.GetYear()<<"\n";
gettimeofday (&tvalAfter, NULL);
double runtime = (((tvalAfter.tv_sec - tvalBefore.tv_sec)*1000000L
+tvalAfter.tv_usec) - tvalBefore.tv_usec)/1000000.;
cout << "TIME "<<runtime<<" sec"<<endl;
return 0;
}
UPDATE:
I have tested things on a more realistic ground as I wrote in the comments and this is what I found:
If we have only two classes, e.g. one to generate random numbers and another to generate 2-d vectors (which also adds and subtracts vectors) and we use in our main() a for-loop to generate in each iteration few 2-d vectors and add/subtract/operate on them somehow, then all-classes-in-one-file gives a better runtime by a 5-10%, depending on the operations on the vectors and their number within the for-loop.
However, if what we do in our main() is not simply iterating over a for-loop and we have more classes, say of order 5, with more complex operations, then the approach every-class-in-its-own-file has no significantly longer runtime compared to all-classes-in-one-file!
Many thanks for the insight!
Compiler optimizations can have impressive effects. Loop optimizations are among the most profitable ones, and what you're seeing is a standard optimization: if the compiler can prove that a loop has no other effects and performs only a tractable action, it can remove the loop entirely and replace it just by the resulting final state.
Clearly, in your case, inclusion of the class member definitions allows the compiler to see that there is no side effect in the Year constructor aand copy-assignment operator, so the only effect of the loop is to set the final value.
For a simpler demonstration, consider this code:
int main()
{
int val = 0;
for (int i = 1; i <= 10; ++i) { val += i; }
return val;
}
Let's look at mildly optimized code:
main:
.LFB0:
.cfi_startproc
mov eax, 55
ret
.cfi_endproc
As you can see, the compiler has figured out what the loop does.
In the separate-compilation case, the compiler must produce object code for the Year class methods and later link it against main(). This means the function calls will be done the normal way. But if you expose all the code in a single translation unit and compile it at once, the compiler (or, well, the optimizer) can see everything and build it for faster performance. Specifically, your constructor can be inlined if the compiler sees all the code, otherwise not. It's even possible that the compiler will completely eliminate the loop or at least the copies within it when it can see all your code.
If you want the same high performance with separate files, simply #include "Year.cpp" in your main file and compile just that one file. This will have the same effect (single translation unit) without cluttering up your main file. To do this "for real" you'd want to define all the methods as inline to avoid linker errors, though in a single-file program this won't matter.
In most cases, the distribution of functions to more than one file, has no impact on performance.
However, there are a few cases of optimization where the compiler can better optimize performance when all the functions are in the same module. One example is inline.
Automatic Inlining
If the compiler can find the implementation of a function in the same file (translation unit) as the calling function, the compiler may be able to inline the function. To inline means to replace the function call with the actual code of the called function. The compiler may invoke this on some of the more advanced optimization settings. This would be performed by the compiler.
One method around this is to declare and define small functions as inline in the header file. This is a suggestion to the compiler to paste the code in from the header file.
Linker Optimization
Some linker options allow for optimizations across translation units. These optimizations may counter any effects of distributing code among different files.
Negligible Loss
In most applications, the loss of performance is negligible. In most cases, functions are distributed to improve development and maintenance times, which are more costly than performance bottlenecks.
I suggest not to worry about any performance loss by distributing functions. After your program works correctly and robustly, then worry about performance. When you need to improve performance optimize first to find where the bottlenecks are and concentrate your performance optimizations in those areas. Most of them will not be related to function distribution.
Cross-procedure optimizations, the most important of which is inlining, require the optimizer to see into all related procedures at the same time. Luckily, the sweet spot for inlining generally corresponds to very short functions that don't deserve to be placed into a separate implementation file.
For your example, I would recommend a header-only implementation:
#ifndef YEAR_H
#define YEAR_H
class Year
{
private:
long m_nYear;
Year(void) : m_nYear(-1) {}
public:
Year(long nYear) : m_nYear(nYear) {}
void SetYear(long nYear) { m_nYear = nYear; }
long GetYear(void) { return m_nYear; }
};
#endif
Now you won't have that #include "some.cpp" code smell.
I've also removed useless semicolons and demonstrated the right way to initialize class subobjects, using the initializer list.
In most cases, prefer to put function bodies inside the class definition. A function that feels too long to put in the header file, is also unlikely to be inlined.
The downside is that you lose benefits of compile firewalling, as may be provided by pimpl. But you can't get aggressive cross-procedure optimization and minimal compile surface at the same time.
According to the GCC manual, the -fipa-pta optimization does:
-fipa-pta: Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive
memory and compile-time usage on large compilation units. It is not
enabled by default at any optimization level.
What I assume is that GCC tries to differentiate mutable and immutable data based on pointers and references used in a procedure. Can someone with more in-depth GCC knowledge explain what -fipa-pta does?
I think the word "interprocedural" is the key here.
I'm not intimately familiar with gcc's optimizer, but I've worked on optimizing compilers before. The following is somewhat speculative; take it with a small grain of salt, or confirm it with someone who knows gcc's internals.
An optimizing compiler typically performs analysis and optimization only within each individual function (or subroutine, or procedure, depending on the language). For example, given code like this contrived example:
double *ptr = ...;
void foo(void) {
...
*ptr = 123.456;
some_other_function();
printf("*ptr = %f\n", *ptr);
}
the optimizer will not be able to determine whether the value of *ptr has been changed by the call to some_other_function().
If interprocedural analysis is enabled, then the optimizer can analyze the behavior of some_other_function(), and it may be able to prove that it can't modify *ptr. Given such analysis, it can determine that the expression *ptr must still evaluate to 123.456, and in principle it could even replace the printf call with puts("ptr = 123.456");.
(In fact, with a small program similar to the above code snippet I got the same generated code with -O3 and -O3 -fipa-pta, so I'm probably missing something.)
Since a typical program contains a large number of functions, with a huge number of possible call sequences, this kind of analysis can be very expensive.
As quoted from this article:
The "-fipa-pta" optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
*x = *y;
}
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return 15;
}
A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);
for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.
I'm doing a bit of hands on research surrounding the speed benefits of making a function inline. I don't have the book with me, but one text I was reading, was suggesting a fairly large overhead cost to making function calls; and when ever executable size is either negligible, or can be spared, a function should be declared inline, for speed.
I've written the following code to test this theory, and from what I can tell, there is no speed benifit from declaring a function as inline. Both functions, when called 4294967295 times, on my computer, execute in 196 seconds.
My question is, what would be your thoughts as to why this is happening? Is it modern compiler optimization? Would it be the lack of large calculations taking place in the function?
Any insight on the matter would be appreciated. Thanks in advance friends.
#include < iostream >
#include < time.h >
// RESEARCH Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
// Two functions that preform an identacle arbitrary floating point calculation
// one function is inline, the other is not.
double test(double a, double b, double c);
double inlineTest(double a, double b, double c);
double test(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
inline
double inlineTest(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
// ENTRY POINT Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
int main(){
const unsigned int maxUINT = -1;
clock_t start = clock();
//============================ NON-INLINE TEST ===============================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
clock_t end = clock();
std::cout << maxUINT << " calls to non inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
start = clock();
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
end = clock();
std::cout << maxUINT << " calls to inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
getchar(); // Wait for input.
return 0;
} // Main.
Assembly Output
PasteBin
The inline keyword is basically useless. It is a suggestion only. The compiler is free to ignore it and refuse to inline such a function, and it is also free to inline a function declared without the inline keyword.
If you are really interested in doing a test of function call overhead, you should check the resultant assembly to ensure that the function really was (or wasn't) inlined. I'm not intimately familiar with VC++, but it may have a compiler-specific method of forcing or prohibiting the inlining of a function (however the standard C++ inline keyword will not be it).
So I suppose the answer to the larger context of your investigation is: don't worry about explicit inlining. Modern compilers know when to inline and when not to, and will generally make better decisions about it than even very experienced programmers. That's why the inline keyword is often entirely ignored. You should not worry about explicitly forcing or prohibiting inlining of a function unless you have a very specific need to do so (as a result of profiling your program's execution and finding that a bottleneck could be solved by forcing an inline that the compiler has for some reason not done).
Re: the assembly:
; 30 : const unsigned int maxUINT = -1;
; 31 : clock_t start = clock();
mov esi, DWORD PTR __imp__clock
push edi
call esi
mov edi, eax
; 32 :
; 33 : //============================ NON-INLINE TEST ===============================//
; 34 : for(unsigned int i = 0; i < maxUINT; ++i)
; 35 : blank(1.1,2.2,3.3);
; 36 :
; 37 : clock_t end = clock();
call esi
This assembly is:
Reading the clock
Storing the clock value
Reading the clock again
Note what's missing: calling your function a whole bunch of times
The compiler has noticed that you don't do anything with the result of the function and that the function has no side-effects, so it is not being called at all.
You can likely get it to call the function anyway by compiling with optimizations off (in debug mode).
Both the functions could be inlined. The definition of the non-inline function is in the same compilation unit as the usage point, so the compiler is within its rights to inline it even without you asking.
Post the assembly and we can confirm it for you.
EDIT: the MSVC compiler pragma for banning inlining is:
#pragma auto_inline(off)
void myFunction() {
// ...
}
#pragma auto_inline(on)
Two things could be happening:
The compiler may either be inlining both or neither functions. Check your compiler documentation for how to control that.
Your function may be complex enough that the overhead of doing the function call isn't big enough to make a big difference in the tests.
Inlining is great for very small functions but it's not always better. Code bloat can prevent the CPU from caching code.
In general inline getter/setter functions and other one liners. Then during performance tuning you can try inlining functions if you think you'll get a boost.
Your code as posted contains a couple oddities.
1) The math and output of your test functions are completely independent of the function parameters. If the compiler is smart enough to detect that those functions always return the same value, that might give it incentive to optimize them out entirely inline or not.
2) Your main function is calling test for both the inline and non-inline tests. If this is the actual code that you ran, then that would have a rather large role to play in why you saw the same results.
As others have suggested, you would do well to examine the actual assembly code generated by the compiler to determine that you're actually testing what you intended to.
Um, shouldn't
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
be
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
inlineTest(1.1,2.2,3.3);
?
But if that was just a typo, would recommend that look at a dissassembler or reflector to see if the code is actually inline or still stack-ed.
If this test took 196 seconds for each loop, then you must not have turned optimizations on; with optimizations off, generally compilers don't inline anything.
With optimization on, however, the compiler is free to notice that your test function can be completely evaluated at compile time, and crush it down to "return [constant]" -- at which point, it may well decide to inline both functions since they're so trivial, and then notice that the loops are pointless since the function value is not used, and squash that out too! This is basically what I got when I tried it.
So either way, you're not testing what you thought you tested.
Function call overhead ain't what it used to be, compared to the overhead of blowing out the level-1 instruction cache, which is what aggressive inlining does to you. You can easily find reports online of gcc's -Os option (optimize for size) being a better default choice for large projects than -O2, and the big reason for that is that -O2 inlines a lot more aggressively. I would expect it is much the same with MSVC.
The only way I know of to guarantee a function is inline is to #define it
For example:
#define RADTODEG(x) ((x) * 57.29578)
That said, the only time I would bother with such a function would be in an embedded system. On a desktop/server the performance difference is negligible.
Run it in a debugger and have a look at the generated code to see if your function is always or never inlined. I think it's always a good idea to have a look at the assembler code when you want more knowledge about the optimization the compiler does.
Apologies for a small flame ...
Compilers think in assembly language. You should too. Whatever else you do, just step through the code at the assembler level. Then you'll know exactly what the compiler did.
Don't think of performance in absolute terms like "fast" or "slow". It's all relative, percentage-wise. The way software is made fast is by removing, in successive steps, things that take too large a percent of the time.
Here's the flame: If a compiler can do a pretty good job of inlining functions that clearly need it, and if it can do a really good job of managing registers, I think that's just what it should do. If it can do a reasonable job of unrolling loops that clearly could use it, I can live with that. If it's knocking itself out trying to outsmart me by removing function calls that I clearly wrote and intended to be called, or scrambling my code sanctimoniously trying to save a JMP when that JMP occupies 0.000001% of running time (the way Fortran does), I get annoyed, frankly.
There seems to be a notion in the compiler world that there's no such thing as an unhelpful optimization. No matter how smart the compiler is, real optimization is the programmer's job, and nobody else's.