All,
I'm writing some performance sensitive code, including a 3d vector class that will be doing lots of cross-products. As a long-time C++ programmer, I know all about the evils of macros and the various benefits of inline functions. I've long been under the impression that inline functions should be approximately the same speed as macros. However, in performance testing macro vs inline functions, I've come to an interesting discovery that I hope is the result of me making a stupid mistake somewhere: the macro version of my function appears to be over 8 times as fast as the inline version!
First, a ridiculously trimmed down version of a simple vector class:
class Vector3d
{
public:
double m_tX, m_tY, m_tZ;
Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {}
Vector3d(const double &tX, const double &tY, const double &tZ):
m_tX(tX), m_tY(tY), m_tZ(tZ) {}
static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV )
{
cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY;
cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ;
cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX;
}
#define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \
cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \
cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \
cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; }
};
Here's my sample benchmarking code:
Vector3d right;
Vector3d forward(1.0, 2.2, 3.6);
Vector3d up(3.2, 1.4, 23.6);
clock_t start = clock();
for (long l=0; l < 100000000; l++)
{
Vector3d::CrossAndAssign(forward, up, right); // static inline version
}
clock_t end = clock();
std::cout << end - start << endl;
clock_t start2 = clock();
for (long l=0; l<100000000; l++)
{
FastVectorCrossAndAssign(forward, up, right); // macro version
}
clock_t end2 = clock();
std::cout << end2 - start2 << endl;
The end result: With optimizations turned completely off, the inline version takes 3200 ticks, and the macro version 500 ticks... With optimization turned on (/O2, maximize speed, and other speed tweaks), I can get the inline version down to 1100 ticks, which is better but still not the same.
So I appeal to all of you: is this really true? Have I made a stupid mistake somewhere? Or are inline functions really this much slower -- and if so, why?
NOTE: After posting this answer, the original question was edited to remove this problem. I'll leave the answer as it is instructive on several levels.
The loops differ in what they do!
if we manually expand the macro, we get:
for (long l=0; l<100000000; l++)
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Note the absense of curly brackets. So the compiler sees this as:
for (long l=0; l<100000000; l++)
{
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
}
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Which makes it obvious why the second loop is so much faster.
Udpate: This is also a good example of why macros are evil :)
please note that if you use the inline keyword, this is only a hint for the compiler. If you turn optimizations off, this might cause the compiler not to inline the function. You should go to Project Settings/C++/Optimization/ and make sure to turn Optimization on. What settings have you used for "Inline Function Expansion"?
it also depends optimizations and compiler settings. also look for your compiler's support for an always inline/force inline declaration. inlining is as fast as a macro.
by default, the keyword is a hint -- force inline/always inline (for the most part) returns the control to the programmer of the original intention of the keyword.
finally, gcc (for example) can be directed to inform you when such a function is not inlined as directed.
Apart from what Philipp mentioned, if your using MSVC, you can use __forceinline or the gcc __attrib__ equivalent to correct the probelems with inlining.
However, there is another possible problem lurking, using a macro will cause the parameters of the macro to be re-evaluated at each point, so if you call the macro like so:
FastVectorCrossAndAssign(getForward(), up, right);
it will expand to:
right.m_tX = getForward().m_tY * up.m_tZ - getForward().m_tZ * up.m_tY;
right.m_tY = getForward().m_tZ * up.m_tX - getForward().m_tX * up.m_tZ;
right.m_tZ = getForward().m_tX * up.m_tY - getForward().m_tY * up.m_tX;
not want you want when your concerned with speed :) (especially if getForward() isn't a lightweight function, or does some incrementing each call, if its an inline function, the compiler might fix the amount of calls, provided it isn't volatile, that still won't fix everything though)
Related
Here's two example. Set float values on a vector every time within a loop:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v - _mm_set1_ps(0.5f)), _mm_set1_ps(2.0f));
paramKnob.v = _mm_mul_ps(paramKnob.v, _mm_set1_ps(kFMScale));
// code...
}
or do it once and "reload" the same register every time:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
__m128 v05 = _mm_set1_ps(0.5f);
__m128 v2 = _mm_set1_ps(2.0f);
__m128 vFMScale = _mm_set1_ps(kFMScale);
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v, v05), v2);
paramKnob.v = _mm_mul_ps(paramKnob.v, vFMScale);
// code...
}
which one is generally the best and suited approch? I'll bet on the second, but vectorization most of the time fool me.
And what if I use them as const in the whole project instead of "within a block"? "load everywhere" instead of "set everywhere" would be better?
i think its all about cache missing rather than latency/throughput used by the operations.
I'm on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations
In general, for best performance you want to make the loop body as concise as possible and move any code invariant to the iteration out of the loop body. Whether a given compiler is able to do it for you depends on its ability to perform this kind of optimization and on the currently enabled optimization level.
Modern gcc and clang versions are typically smart enough to convert _mm_set1_ps and similar intrinsics with constant arguments to in-memory constants, so even without hoisting this results in a fairly efficient binary code. On the other hand, MSVC is not very smart in SIMD optimizations.
As a rule of thumb I would still recommend to move constant generation (except for all-zero or all-ones constants) out of the loop body. There are several reasons to do so, even if your compiler is capable to do so on its own:
You are not relying on the compiler to do the optimization, which makes your code more portable across compilers. The code will also perform more similarly between different optimization levels, which may be useful.
Moving the constant out of the loop body may convince the compiler to pre-load the constant before entering the loop instead of referencing it in-memory inside the loop. Again, this is subject to compiler's optimization capabilities and current optimization level.
The constants can be reused in multiple places, including multiple functions. This reduces the binary size and makes your code more cache-friendly in run time. (Some linkers are able to merge equivalent constants when linking the binary, but this feature is not universally supported, it is subject to the optimization level, and in some cases makes the code non-compliant to the C/C++ standards, which can affect code correctness. For this reason, this feature is normally disabled by default, even when supported.)
If you are going to declare a constant in namespace scope, it is highly recommended to use an aligned array and/or a union to ensure static initialization. Compilers may not be able to perform static initialization if you use intrinsics to initialize the constant. You can use something like the following:
template< typename T >
union mm_constant128
{
T as_array[sizeof(__m128i) / sizeof(T)];
__m128i as_m128i;
__m128 as_m128;
__m128d as_m128d;
operator __m128i () const noexcept { return as_m128i; }
operator __m128 () const noexcept { return as_m128; }
operator __m128d () const noexcept { return as_m128d; }
};
constexpr mm_constant128< float > k05 = {{ 0.5f, 0.5f, 0.5f, 0.5f }};
void foo()
{
// Use as follows:
_mm_sub_ps(paramKnob.v, k05)
}
If you are targeting a specific compiler at a given optimization level, you can inspect the generated assembler code to see whether the compiler is doing good enough optimization on its own.
Or do this:
Declare a compile-time constant __m128 outside your function, preferably in a "constants" namespace.
inline constexpr __m128 _val = {0.5f,0.5f,0.5f,0.5f};
//Your function here
This will save you from using the _mm_set1_ps function, which does not evaluate to 1 instruction, it is a rather inefficient sequence of instructions.
Since templates are resolved at compile-time, there will be little to no performance difference between this and the other answer. However, I think this code is significantly simpler.
I wrote following code to calculate intersecting points of two circles. The code is simple and fast enough. Not that I need more optimization, but I can think of optimizing this code more aggressively. For example h/d and 1.0/d are calculated twice (Let's forget about compiler optimizations).
const std::array<point,2> intersect(const circle& a,
const circle& b) {
std::array<point,2> intersect_points;
const float d2 = squared_distance(a.center, b.center);
const float d = std::sqrt(d2);
const float r12 = std::pow(a.radious, 2);
const float r22 = std::pow(b.radious, 2);
const float l = (r12 - r22 + d2) / (2*d);
const float h = std::sqrt(r12 - std::pow(l,2));
const float termx1 = (1.0/d) * (b.center.x - a.center.x) + a.center.x;
const float termx2 = (h/d)*(b.center.y - a.center.y);
const float termy1 = (1.0/d) * (b.center.y - a.center.y) + a.center.y;
const float termy2 = (h/d)*(b.center.x - a.center.x);
intersect_points[0].x = termx1 + termx2;
intersect_points[0].y = termy1 - termy2;
intersect_points[1].x = termx1 - termx2;
intersect_points[1].y = termy1 + termy2;
return intersect_points;
}
My question is how much we can trust C++ compilers (g++ here) to understand the code and optimize final binary? Can g++ avoid doing 1.0/d twice? More precisely I want to know where is the line. When we should leave fine tuning to compiler and when do we do optimization?
Popular compilers are pretty good in optimization nowadays.
It is very likely that the optimizer detects common expressions like 1.0/d, so don't care about this one.
It is much less likely that the optimizer replaces std:pow( x, 2 ) by x * x.
This depends of the exact function you use, the compiler version you are using and the optimization command line switches. So in this case, you're better off to write x * x.
It's hard to say how far an optimizer can go and when you as a human must take over, this depends on how "smart" the optimizer is. But as a rule of thumb, the compiler can ony optimize things it can deduct from the lines of code.
Example:
The compiler will know, that this term is always false: 1 == 2
But it can't know that this is always false as well: 1 == nextPrimeAfter(1), because therefore it would have to have knowledge about what the function nextPrimeAfter() does.
I don't understand why the result is going to be 36. Can somebody explain to me what is happening here and what the preprocessor does?
#include <iostream>
#define QUADRAT(x) ((x) * (x))
using namespace std;
int main()
{
double no = 4.0;
double result = QUADRAT(++no);
cout << result;
return 0;
}
Thanks alot :>
The preprocessor will replace QUADRAT(++no) with ((++no) * (++no)) in that example.
That would increment no twice, if it weren't for the fact that there are no sequence points between the two increments, so you are actually causing undefined behaviour. Any output you see is valid because no one can tell what will happen.
The preprocessor is a basically a copy-and-paste engine. Note that a macro is not a function; instead, it's expanded inline. Consider what happens to your code when the macro is expanded.
This line:
double result = QUADRAT(++no);
gets expanded to this:
double result = ((++no) * (++no));
The way that this ends up running is equivalent to:
no = no + 1;
no = no + 1;
result = no * no;
It runs this way because the increments are performed before the multiplication: the preprocessor does a textual copy of what you pass it, so it copies the "++no" so that it shows up twice in the final code, and the increment of each ++no happens before the result is computed. The way to fix this is to use an inline function:
inline double QUADRAT(double x) { return x * x; }
Most modern compilers will expand this code out without doing a textual substitution - they'll give you something as fast as a preprocessor definition, but without the danger of having issues like the one you are facing.
What happens is a sort of literal substitution of ++no in place of x, it is like you have written:
double result = ((++no) * (++no));
What is the result... it should be undefined behaviour (you get 36 by chance), and g++ with -Wall agrees with me.
I have the following looking code in VC++:
for (int i = (a - 1) * b; i < a * b && i < someObject->someFunction(); i++)
{
// ...
}
As far as I know compilers optimize all these arithmetic operations and they won't be executed on each loop, but I'm not sure if they can tell that the function above also returns the same value each time and it doesn't need to be called each time.
Is it a better practice to save all calculations into variables, or just rely on compiler optimizations to have a more readable code?
int start = (a - 1) * b;
int expra = a * b;
int exprb = someObject->someFunction();
for (int i = startl i < expra && i < exprb; i++)
{
// ...
}
Short answer: it depends. If the compiler can deduce that running someObject->someFunction() every time and caching the result once both produce the same effects, it is allowed (but not guaranteed) to do so. Whether this static analysis is possible depends on your program: specifically, what the static type of someObject is and what its dynamic type is expected to be, as well as what someFunction() actually does, whether it's virtual, and so on.
In general, if it only needs to be done once, write your code in such a way that it can only be done once, bypassing the need to worry about what the compiler might be doing:
int start = (a - 1) * b;
int expra = a * b;
int exprb = someObject->someFunction();
for (int i = start; i < expra && i < exprb; i++)
// ...
Or, if you're into being concise:
for (int i = (a - 1) * b, expra = a * b, exprb = someObject->someFunction();
i < expra && i < exprb; i++)
// ...
From my experience VC++ compiler won't optimize the function call out unless it can see the function implementation at the point of compiling the calling code. So moving the call outside the loop is a good idea.
If a function resides within the same compilation unit as its caller, the compiler can often deduce some facts about it - e.g. that its output might not change for subsequent calls. In general, however, that is not the case.
In your example, assigning variables for these simple arithmetic expressions does not really change anything with regards to the produced object code and, in my opinion, makes the code less readable. Unless you have a bunch of long expressions that cannot reasonably be put within a line or two, you should avoid using temporary variables - if for no other reason, then just to reduce namespace pollution.
Using temporary variables implies a significant management overhead for the programmer, in order to keep them separate and avoid unintended side-effects. It also makes reusing code snippets harder.
On the other hand, assigning the result of the function to a variable can help the compiler optimise your code better by explicitly avoiding multiple function calls.
Personally, I would go with this:
int expr = someObject->someFunction();
for (int i = (a - 1) * b; i < a * b && i < expr; i++)
{
// ...
}
The compiler cannot make any assumption on whether your function will return the same value at each time. Let's imagine that your object is a socket, how could the compiler possibly know what will be its output?
Also, the optimization that a compiler can make in such loops strongly depends on the whether a and b are declared as const or not, and whether or not they are local. With advanced optimization schemes, it may be able to infer that a and b are neither modified in the loop nor in your function (again, you might imagine that your object holds some reference to them).
Well, in short: go for the second version of your code!
It is very likely that the compiler will call the function each time.
If you are concerned with the readability of code, what about using:
int maxindex = min (expra, exprb);
for (i=start; i<maxindex; i++)
IMHO, long lines does not improve readability.
Writing short lines and doing multiple step to get a result, does not impact the performance, this is exactly why we use compilers.
Effectively what you might be asking is whether the compiler will inline the function someFunction() and whether it will see that someObject is the same instance in each loop, and if it does both it will potentially "cache" the return value and not keep re-evaluating it.
Much of this may depend on what optimisation settings you use, with VC++ as well as any other compiler, although I am not sure VC++ gives you quite as many flags as gnu.
I often find it incredible that programmers rely on compilers to optimise things they can easily optimise themselves. Just move the expression to the first section of the for-loop if you know it will evaluate the same each time:
Just do this and don't rely on the compiler:
for (int i = (a - 1) * b, iMax = someObject->someFunction();
i < a * b && i < iMax; ++i)
{
// body
}
I'm doing a bit of hands on research surrounding the speed benefits of making a function inline. I don't have the book with me, but one text I was reading, was suggesting a fairly large overhead cost to making function calls; and when ever executable size is either negligible, or can be spared, a function should be declared inline, for speed.
I've written the following code to test this theory, and from what I can tell, there is no speed benifit from declaring a function as inline. Both functions, when called 4294967295 times, on my computer, execute in 196 seconds.
My question is, what would be your thoughts as to why this is happening? Is it modern compiler optimization? Would it be the lack of large calculations taking place in the function?
Any insight on the matter would be appreciated. Thanks in advance friends.
#include < iostream >
#include < time.h >
// RESEARCH Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
// Two functions that preform an identacle arbitrary floating point calculation
// one function is inline, the other is not.
double test(double a, double b, double c);
double inlineTest(double a, double b, double c);
double test(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
inline
double inlineTest(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
// ENTRY POINT Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
int main(){
const unsigned int maxUINT = -1;
clock_t start = clock();
//============================ NON-INLINE TEST ===============================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
clock_t end = clock();
std::cout << maxUINT << " calls to non inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
start = clock();
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
end = clock();
std::cout << maxUINT << " calls to inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
getchar(); // Wait for input.
return 0;
} // Main.
Assembly Output
PasteBin
The inline keyword is basically useless. It is a suggestion only. The compiler is free to ignore it and refuse to inline such a function, and it is also free to inline a function declared without the inline keyword.
If you are really interested in doing a test of function call overhead, you should check the resultant assembly to ensure that the function really was (or wasn't) inlined. I'm not intimately familiar with VC++, but it may have a compiler-specific method of forcing or prohibiting the inlining of a function (however the standard C++ inline keyword will not be it).
So I suppose the answer to the larger context of your investigation is: don't worry about explicit inlining. Modern compilers know when to inline and when not to, and will generally make better decisions about it than even very experienced programmers. That's why the inline keyword is often entirely ignored. You should not worry about explicitly forcing or prohibiting inlining of a function unless you have a very specific need to do so (as a result of profiling your program's execution and finding that a bottleneck could be solved by forcing an inline that the compiler has for some reason not done).
Re: the assembly:
; 30 : const unsigned int maxUINT = -1;
; 31 : clock_t start = clock();
mov esi, DWORD PTR __imp__clock
push edi
call esi
mov edi, eax
; 32 :
; 33 : //============================ NON-INLINE TEST ===============================//
; 34 : for(unsigned int i = 0; i < maxUINT; ++i)
; 35 : blank(1.1,2.2,3.3);
; 36 :
; 37 : clock_t end = clock();
call esi
This assembly is:
Reading the clock
Storing the clock value
Reading the clock again
Note what's missing: calling your function a whole bunch of times
The compiler has noticed that you don't do anything with the result of the function and that the function has no side-effects, so it is not being called at all.
You can likely get it to call the function anyway by compiling with optimizations off (in debug mode).
Both the functions could be inlined. The definition of the non-inline function is in the same compilation unit as the usage point, so the compiler is within its rights to inline it even without you asking.
Post the assembly and we can confirm it for you.
EDIT: the MSVC compiler pragma for banning inlining is:
#pragma auto_inline(off)
void myFunction() {
// ...
}
#pragma auto_inline(on)
Two things could be happening:
The compiler may either be inlining both or neither functions. Check your compiler documentation for how to control that.
Your function may be complex enough that the overhead of doing the function call isn't big enough to make a big difference in the tests.
Inlining is great for very small functions but it's not always better. Code bloat can prevent the CPU from caching code.
In general inline getter/setter functions and other one liners. Then during performance tuning you can try inlining functions if you think you'll get a boost.
Your code as posted contains a couple oddities.
1) The math and output of your test functions are completely independent of the function parameters. If the compiler is smart enough to detect that those functions always return the same value, that might give it incentive to optimize them out entirely inline or not.
2) Your main function is calling test for both the inline and non-inline tests. If this is the actual code that you ran, then that would have a rather large role to play in why you saw the same results.
As others have suggested, you would do well to examine the actual assembly code generated by the compiler to determine that you're actually testing what you intended to.
Um, shouldn't
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
be
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
inlineTest(1.1,2.2,3.3);
?
But if that was just a typo, would recommend that look at a dissassembler or reflector to see if the code is actually inline or still stack-ed.
If this test took 196 seconds for each loop, then you must not have turned optimizations on; with optimizations off, generally compilers don't inline anything.
With optimization on, however, the compiler is free to notice that your test function can be completely evaluated at compile time, and crush it down to "return [constant]" -- at which point, it may well decide to inline both functions since they're so trivial, and then notice that the loops are pointless since the function value is not used, and squash that out too! This is basically what I got when I tried it.
So either way, you're not testing what you thought you tested.
Function call overhead ain't what it used to be, compared to the overhead of blowing out the level-1 instruction cache, which is what aggressive inlining does to you. You can easily find reports online of gcc's -Os option (optimize for size) being a better default choice for large projects than -O2, and the big reason for that is that -O2 inlines a lot more aggressively. I would expect it is much the same with MSVC.
The only way I know of to guarantee a function is inline is to #define it
For example:
#define RADTODEG(x) ((x) * 57.29578)
That said, the only time I would bother with such a function would be in an embedded system. On a desktop/server the performance difference is negligible.
Run it in a debugger and have a look at the generated code to see if your function is always or never inlined. I think it's always a good idea to have a look at the assembler code when you want more knowledge about the optimization the compiler does.
Apologies for a small flame ...
Compilers think in assembly language. You should too. Whatever else you do, just step through the code at the assembler level. Then you'll know exactly what the compiler did.
Don't think of performance in absolute terms like "fast" or "slow". It's all relative, percentage-wise. The way software is made fast is by removing, in successive steps, things that take too large a percent of the time.
Here's the flame: If a compiler can do a pretty good job of inlining functions that clearly need it, and if it can do a really good job of managing registers, I think that's just what it should do. If it can do a reasonable job of unrolling loops that clearly could use it, I can live with that. If it's knocking itself out trying to outsmart me by removing function calls that I clearly wrote and intended to be called, or scrambling my code sanctimoniously trying to save a JMP when that JMP occupies 0.000001% of running time (the way Fortran does), I get annoyed, frankly.
There seems to be a notion in the compiler world that there's no such thing as an unhelpful optimization. No matter how smart the compiler is, real optimization is the programmer's job, and nobody else's.