Recently I've noticed that in a project where I have several source files (file1.cpp, file2.cpp, ...) it may affect to execution time wether function A, that will be called by another function B, is defined in the same source file than that function B or not.
In my case, when both are defined in the same file1.cpp, function B takes about 90% of execution time, and profiler analysis does not return execution time for function A (called by B).
BUT if they are defined in separated files, then execution times increases in ~150% and function A takes ~65% of time, while B is just ~25 (about 90% in total).
Why has execution time increased? Has function defitinion location an effect on how are they called? I can't figure out.
I should say at this point that I'm using optimization level 3 so function A should be inlined in B in both cases.
EDIT: I'm using Linux Ubuntu 14.04, anf I compile with g++ and the following flags: -O3 -pg -ggdb -Wall -Wextra -std=c++11.
I include also A and B son it can be better understood. As you can she, A is called from B by another C function, but that one seems to not be a problem:
A:
size_t A (const Matrix& P, size_t ID) {
size_t j(0);
while (P[j][0]!=ID) {
++j;
}
return j;
}
B:
Matrix B (const Matrix& P, const Matrix& O, Matrix* pREL, double d, const Vector& f) {`
size_t length (O.size()) ;
Matrix oREL ( *pREL ) ;
for (size_t i(0); i<length; ++i) {
for (size_t j(0); j<=i; ++j) {
double fi(f[O[i][0]-1]);
if (f.size()==1) fi = 0.0;
if (i!=j) {
double gAC, gAD, gBC, gBD, fj(f[O[j][0]-1]);
if (f.size()==1) fj = 0.0;
gAC = C(pREL,P,O,i,j,dcol,dcol);
gAD = C(pREL,P,O,i,j,dcol,scol);
gBC = C(pREL,P,O,i,j,scol,dcol);
gBD = C(pREL,P,O,i,j,scol,scol);
oREL[i][j] = 0.25 * (gAC + gAD + gBC + gBD)
* (1 - d*(fi+fj));
} else if (i==j) oREL[i][i] = 0.5 * ( 1.0+C(pREL,P,O,i,i,dcol,scol) )
* (1.0-2.0*d*fi );
}
}
delete pREL;
return oREL;
}
C:
coefficient C (Matrix * pREL, const Matrix& P, const Matrix& O,
size_t coord1, size_t coord2, unsigned p1, unsigned p2) {
double g;
size_t i, j ;
i = A(P,O[coord1][p1]);
j = A(P,O[coord2][p2]);
if (i<=j) g = (*pREL)[j][i];
if (i>j ) g = (*pREL)[i][j];
return g;
}
Yes. The compiler can only inline a function when it knows the function definition at the point of inlining. It may not know it if you place it in other compilation unit. In your case I'd assume that the compiler is "thinking": he's calling this function but I don't know where it is yet, so I make a normal call and let the linker worry about it later.
For that reason code that should be inlined is very often placed in header files.
First, when you care about performance and benchmarking, you should enable optimizations in your compiler.
I assume you are using GCC on Linux so you compile with g++.
You should compile with g++ -Wall -O2 if you care about performance.
You could enable link time optimizations by compiling and linking with g++ -Wall -O2 -flto
Of course you could use -O3 instead of -O2
Related
In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.
If I do something like this:
static int counter = 0;
counter = std::min(8, counter++);
I get a warning with g++ saying:
operation on 'counter' may be undefined [-Wsequence-point]
This works fine:
static int counter = 0;
counter++;
counter = std::min(8, counter);
Results are the same with ++counter and/or std::max.
I can't work out what's wrong with the first version. This also occurs with std::max. Just for an example, I get no warning when using functions from GLM.
Can anyone explain this a little bit for me? I'm using GCC 4.8 on Ubuntu 14.04.
EDIT: A bit more testing (which I should have done first)
If I do cntr = std::min(8, ++cntr);, as I am in my actual application, printing the value after this line results in 1, then 2, then 3, etc. HOWEVER, if I do cntr = std::min(8, cntr++);, the vaule is 0 EVERY TIME, it never increases at all.
I think Mike Seymour's comment is correct - there should be no UB because the function argument expressions should be sequenced before the function call. From 1.19/15:
When calling a function (whether or not the function is inline), every value computation and side effect associated with any argument expression, or with the postfix expression designating the called function, is sequenced before execution of every expression or statement in the body of the called function.
So gcc is probably incorrect. Note also that in some other cases, the warnings don't show up:
int f(int a, int b);
template <typename T> T g(T a, T b);
template <typename T> const T& h(const T& a, const T& b);
counter = f(8, counter++); // no warning
counter = g(8, counter++); // no warning
counter = h(8, counter++); // gcc gives warning - there should be nothing special
// about h as compared to f,g... so warning is likely
// incorrect
clang gives no warning on any of these.
Given that the function arguments are sequenced before the call, that explains why this:
int counter = 0;
counter = std::min(8, counter++);
Always returns 0. That code is equivalent to:
counter = 0;
int a1 = 8; // these two can be evaluated in either order
int a2 = counter++; // either way, a2 == 0, counter == 1
counter = std::min(a1, a2); // which is min(8, 0) == 0.
Motivation
I created a header file which wraps Matlab's mex functionality in c++11 classes; especially for MxNxC images. Two functions I created are forEach, which iterates over each pixel in the image, and also a forKernel, which given a kernel and pixel in the image, iterates over the kernel around that pixel, handling all kinds of nifty, boiler-plate indexing mathematics.
The idea is that one could program sliding-windows like this:
image.forEach([](Image &image, size_t row, size_t col) {
//kr and lc specify which pixel is the center of the kernel
image.forKernel<double>(row, col, kernel, kr, kc, [](Image &image, double w, size_t row, size_t col) {
// w is the weight/coefficient of the kernel, row/col are the corresponding coordinates in the image.
// process ...
});
});
Problem
This provides a nice way to
increase readability: the two function calls are a lot clearer than the corresponding 4 for-loops to do the same,
stay flexible: lambda functions allow you to scope all kinds of variables by value or reference, which are invisible to the implementer of forEach / forKernel, and
increase execution time, unfortunately: this executes around 8x slower than using just for loops.
The latter point is the problem, of course. I was hoping g++ would be able to optimize the lambda-functions out and inline all the code. This does not happen. Hence I created a minimal working example on 1D data:
#include <iostream>
#include <functional>
struct Data {
size_t d_size;
double *d_data;
Data(size_t size) : d_size(size), d_data(new double[size]) {}
~Data() { delete[] d_data; }
double &operator[](size_t i) { return d_data[i]; }
inline void forEach(std::function<void(Data &, size_t)> f) {
for (size_t index = 0; index != d_size; ++index)
f(*this, index);
}
};
int main() {
Data im(50000000);
im.forEach([](Data &im, size_t i) {
im[i] = static_cast<double>(i);
});
double sum = 0;
im.forEach([&sum](Data &im, size_t i) {
sum += im[i];
});
std::cout << sum << '\n';
}
source: http://ideone.com/hviTwx
I'm guessing the compiler is not able to compile the code for forEach per lambda-function, as the lambda function is not a template variable. The good thing is that one can compile once and link to it more often with different lambda functions, but the bad thing is that it is slow.
Moreover, the situation discussed in the motivation already contains templates for the data type (double, int, ...), hence the 'good thing' is overruled anyway.
A fast way to implement the previous would be like this:
#include <iostream>
#include <functional>
struct Data {
size_t d_size;
double *d_data;
Data(size_t size) : d_size(size), d_data(new double[size]) {}
~Data() { delete[] d_data; }
double &operator[](size_t i) { return d_data[i]; }
};
int main() {
size_t len = 50000000;
Data im(len);
for (size_t index = 0; index != len; ++index)
im[index] = static_cast<double>(index);
double sum = 0;
for (size_t index = 0; index != len; ++index)
sum += im[index];
std::cout << sum << '\n';
}
source: http://ideone.com/UajMMz
It is about 8x faster, but also less readable, especially when we consider more complicated structures like images with kernels.
Question
Is there a way to provide the lambda function as a template argument, such that forEach is compiled for each call, and optimized for each specific instance of the lambda function? Can the lambda function be inlined somehow, since lambda functions are typically not recursive this should be trivial, but what is the syntax?
I found some related posts:
Why C++ lambda is slower than ordinary function when called multiple times?
Understanding the overhead of lambda functions in C++11
C++0x Lambda overhead
But they do not give a solution in the form of a minimal working example, and they do not discuss the possibility of inlining a lambda function. The answer to my question should do that: change the Data.forEach member function and it's call such that is as fast as possible / allows for as many running time optimizations (not optimizations at run time, but at compile time that decrease runtime) as possible.
Regarding the suggestion of forEveR
Thank you for creating that fix, it's a huge improvement yet still approximately 2x as slow:
test0.cc: http://ideone.com/hviTwx
test1.cc: http://ideone.com/UajMMz
test2.cc: http://ideone.com/8kR3Mw
Results:
herbert#machine ~ $ g++ -std=c++11 -Wall test0.cc -o test0
herbert#machine ~ $ g++ -std=c++11 -Wall test1.cc -o test1
herbert#machine ~ $ g++ -std=c++11 -Wall test2.cc -o test2
herbert#machine ~ $ time ./test0
1.25e+15
real 0m2.563s
user 0m2.541s
sys 0m0.024s
herbert#machine ~ $ time ./test1
1.25e+15
real 0m0.346s
user 0m0.320s
sys 0m0.026s
herbert#machine ~ $ time ./test2
1.25e+15
real 0m0.601s
user 0m0.575s
sys 0m0.026s
herbert#machine ~ $
I re-ran the code with -O2, which fixes the problem. runtimes of test1 and test2 ar now very similar. Thank you #stijn and #forEveR.
herbert#machine ~ $ g++ -std=c++11 -Wall -O2 test0.cc -o test0
herbert#machine ~ $ g++ -std=c++11 -Wall -O2 test1.cc -o test1
herbert#machine ~ $ g++ -std=c++11 -Wall -O2 test2.cc -o test2
herbert#machine ~ $ time ./test0
1.25e+15
real 0m0.256s
user 0m0.229s
sys 0m0.028s
herbert#machine ~ $ time ./test1
1.25e+15
real 0m0.111s
user 0m0.078s
sys 0m0.033s
herbert#machine ~ $ time ./test2
1.25e+15
real 0m0.108s
user 0m0.076s
sys 0m0.032s
herbert#machine ~ $
Problem is, that you use std::function, that actually use type-erasure and virtual calls.
You can simply use template parameter, instead of std::function. Call of lambda function will be inlined, due n3376 5.1.2/5
The closure type for a lambda-expression has a public inline function
call operator (13.5.4) whose param- eters and return type are
described by the lambda-expression’s parameter-declaration-clause and
trailing- return-type respectively
So, just simply write
template<typename Function>
inline void forEach(Function f) {
for (size_t index = 0; index != d_size; ++index)
f(*this, index);
}
Live example
I'm was messing around with tail-recursive functions in C++, and I've run into a bit of a snag with the g++ compiler.
The following code results in a stack overflow when numbers[] is over a couple hundred integers in size. Examining the assembly code generated by g++ for the following reveals that twoSum_Helper is executing a recursive call instruction to itself.
The question is which of the following is causing this?
A mistake in the following that I am overlooking which prevents tail-recursion.
A mistake with my usage of g++.
A flaw in the detection of tail-recursive functions within the g++ compiler.
I am compiling with g++ -O3 -Wall -fno-stack-protector test.c on Windows Vista x64 via MinGW with g++ 4.5.0.
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2);
else
return twoSum_Helper(numbers, size, target, i, j + 1);
}
Tail call optimization in C or C++ is extremely limited, and pretty much a lost cause. The reason is that there generally is no safe way to tail-call from a function that passes a pointer or reference to any local variable (as an argument to the call in question, or in fact any other call in the same function) -- which of course is happening all over the place in C/C++ land, and is almost impossible to live without.
The problem you are seeing is probably related: GCC likely compiles returning a struct by actually passing the address of a hidden variable allocated on the caller's stack into which the callee copies it -- which makes it fall into the above scenario.
Try compilling with -O2 instead of -O3.
How do I check if gcc is performing tail-recursion optimization?
well, it doesn't work with O2 anyway. The only thing that seems to work is returning the result object into a reference that is given as a parameter.
but really, it's much easier to just remove the Tail call and use a loop instead. TCO is here to optimize tail call that are found when inlining or when performing agressive unrolling, but you shouldn't attempt to use recursion when handling large values anyway.
I can't get g++ 4.4.0 (under mingw) to perform tail recursion, even on this simple function:
static void f (int x)
{
if (x == 0) return ;
printf ("%p\n", &x) ; // or cout in C++, if you prefer
f (x - 1) ;
}
I've tried -O3, -O2, -fno-stack-protector, C and C++ variants. No tail recursion.
I would look at 2 things.
The return call in the if statement is going to have a branch target for the else in the stack frame for the current run of the function that needs to be resolved post call (which would mean any TCO attempt would not be able overwrite the executing stack frame thus negating the TCO)
The numbers[] array argument is a variable length data structure which could also prevent TCO because in TCO the same stack frame is used in one way or another. If the call is self referencing (like yours) then it will overwrite the stack defined variables (or locally defined) with the values/references of the new call. If the tail call is to another function then it will overwrite the entire stack frame with the new function (in a case where TCO can be done in A => B => C, TCO could make this look like A => C in memory during execution). I would try a pointer.
It has been a couple months since I have built anything in C++ so I didn't run any tests, but I think one/both of those are preventing the optimization.
Try changing your code to:
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if(j >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i, i + 1);
}
edit: removed unnecessary parameter as per comment from asker
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i)
{
if (numbers[i] + numbers[i+1] == target || i >= (size - 1))
return gen_Result(i, i+1, true);
if(i+1 >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i);
}
Support of Tail Call Optimization (TCO) is limited in C/C++.
So, if the code relies on TCO to avoid stack overflow it may be better to rewrite it with a loop. Otherwise some auto test is needed to be sure that the code is optimized.
Typically TCO may be suppressed by:
passing pointers to objects on stack of recursive function to external functions (in case of C++ also passing such object by reference);
local object with non-trivial destructor even if the tail recursion is valid (the destructor is called before the tail return statement), for example Why isn't g++ tail call optimizing while gcc is?
Here TCO is confused by returning structure by value.
It can be fixed if the result of all recursive calls will be written to the same memory address allocated in other function twoSum (similarly to the answer https://stackoverflow.com/a/30090390/4023446 to Tail-recursion not happening)
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
struct result* twoSum_Helper(int numbers[], int size, int target,
int i, int j, struct result* res_)
{
if (i >= (size - 1)) {
*res_ = gen_Result(i, j, false);
return res_;
}
if (numbers[i] + numbers[j] == target) {
*res_ = gen_Result(i, j, true);
return res_;
}
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2, res_);
else
return twoSum_Helper(numbers, size, target, i, j + 1, res_);
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum(int numbers[], int size, int target)
{
struct result r;
return *twoSum_Helper(numbers, size, target, 0, 1, &r);
}
The value of res_ pointer is constant for all recursive calls of twoSum_Helper.
It can be seen in the assembly output (the -S flag) that the twoSum_Helper tail recursion is optimized as a loop even with two recursive exit points.
Compile options: g++ -O2 -S (g++ version 4.7.2).
I have heard others complain, that tail recursion is only optimized with gcc and not g++.
Could you try using gcc.
Since the code of twoSum_Helper is calling itself it shouldn't come as a surprise that the assembly shows exactly that happening. That's the whole point of a recursion :-) So this hasn't got anything to do with g++.
Every recursion creates a new stack frame, and stack space is limited by default. You can increase the stack size (don't know how to do that on Windows, on UNIX the ulimit command is used), but that only defers the crash.
The real solution is to get rid of the recursion. See for example this question and this question.
All,
I'm writing some performance sensitive code, including a 3d vector class that will be doing lots of cross-products. As a long-time C++ programmer, I know all about the evils of macros and the various benefits of inline functions. I've long been under the impression that inline functions should be approximately the same speed as macros. However, in performance testing macro vs inline functions, I've come to an interesting discovery that I hope is the result of me making a stupid mistake somewhere: the macro version of my function appears to be over 8 times as fast as the inline version!
First, a ridiculously trimmed down version of a simple vector class:
class Vector3d
{
public:
double m_tX, m_tY, m_tZ;
Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {}
Vector3d(const double &tX, const double &tY, const double &tZ):
m_tX(tX), m_tY(tY), m_tZ(tZ) {}
static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV )
{
cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY;
cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ;
cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX;
}
#define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \
cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \
cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \
cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; }
};
Here's my sample benchmarking code:
Vector3d right;
Vector3d forward(1.0, 2.2, 3.6);
Vector3d up(3.2, 1.4, 23.6);
clock_t start = clock();
for (long l=0; l < 100000000; l++)
{
Vector3d::CrossAndAssign(forward, up, right); // static inline version
}
clock_t end = clock();
std::cout << end - start << endl;
clock_t start2 = clock();
for (long l=0; l<100000000; l++)
{
FastVectorCrossAndAssign(forward, up, right); // macro version
}
clock_t end2 = clock();
std::cout << end2 - start2 << endl;
The end result: With optimizations turned completely off, the inline version takes 3200 ticks, and the macro version 500 ticks... With optimization turned on (/O2, maximize speed, and other speed tweaks), I can get the inline version down to 1100 ticks, which is better but still not the same.
So I appeal to all of you: is this really true? Have I made a stupid mistake somewhere? Or are inline functions really this much slower -- and if so, why?
NOTE: After posting this answer, the original question was edited to remove this problem. I'll leave the answer as it is instructive on several levels.
The loops differ in what they do!
if we manually expand the macro, we get:
for (long l=0; l<100000000; l++)
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Note the absense of curly brackets. So the compiler sees this as:
for (long l=0; l<100000000; l++)
{
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
}
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Which makes it obvious why the second loop is so much faster.
Udpate: This is also a good example of why macros are evil :)
please note that if you use the inline keyword, this is only a hint for the compiler. If you turn optimizations off, this might cause the compiler not to inline the function. You should go to Project Settings/C++/Optimization/ and make sure to turn Optimization on. What settings have you used for "Inline Function Expansion"?
it also depends optimizations and compiler settings. also look for your compiler's support for an always inline/force inline declaration. inlining is as fast as a macro.
by default, the keyword is a hint -- force inline/always inline (for the most part) returns the control to the programmer of the original intention of the keyword.
finally, gcc (for example) can be directed to inform you when such a function is not inlined as directed.
Apart from what Philipp mentioned, if your using MSVC, you can use __forceinline or the gcc __attrib__ equivalent to correct the probelems with inlining.
However, there is another possible problem lurking, using a macro will cause the parameters of the macro to be re-evaluated at each point, so if you call the macro like so:
FastVectorCrossAndAssign(getForward(), up, right);
it will expand to:
right.m_tX = getForward().m_tY * up.m_tZ - getForward().m_tZ * up.m_tY;
right.m_tY = getForward().m_tZ * up.m_tX - getForward().m_tX * up.m_tZ;
right.m_tZ = getForward().m_tX * up.m_tY - getForward().m_tY * up.m_tX;
not want you want when your concerned with speed :) (especially if getForward() isn't a lightweight function, or does some incrementing each call, if its an inline function, the compiler might fix the amount of calls, provided it isn't volatile, that still won't fix everything though)