Worse performance with constexpr? - c++

I am trying to evaluate performance difference by using constexpr. I am using the following code:
#include<iostream>
using namespace std;
constexpr double factorial(int n) {
return n==0?1:n*factorial(n-1);
}
main() {
double a=0;
for(int i=0;i<10000000;i++) {
a+=factorial(100);
}
cout<<a<<endl;
}
I tried out two versions of the above program, one with the factorial function as constexpr, and one without. I expected to see the constexpr version perform better during runtime, but it in fact, runs slower. Here are the measurements (in seconds) from 4 trials each:
Without constexpr:
2.691, 2.835, 2.768, 2.748
With constexpr:
2.910, 2.920, 2.903, 2.910
Could someone explain the reason behind this? Am I using constexpr wrong? I am using g++ 4.9.1, and I used the O3 optimization flag.
EDIT: The code originally assigned the factorial to a. It has been updated to add up the results, as suggested in the comments. The performance difference still visible though.

constexpr is advantageous when the computation is done at compile-time. However, compilers aren't required to do that unless you require that, by making a constexpr, for example. At runtime constexpr makes no difference for a function.
I get very close results in my tests (delta of ~0.1s), as expected.

Related

Lambda vs. manually inlined code changes GCC's optimizer behavior

The following code:
#include <vector>
extern std::vector<int> rng;
int main()
{
auto is_even=[](int x){return x%2==0;};
int res=0;
for(int x:rng){
if(is_even(x))res+=x;
}
return res;
}
is optimized by GCC 11.1 (link to Godbolt) in a very different way than:
#include <vector>
extern std::vector<int> rng;
int main()
{
int res=0;
for(int x:rng){
if(x%2==0)res+=x;
}
return res;
}
(Link to Godbolt.) Besides, the second version (where the lambda has been replaced by direct, manual injection of its body in the place of call), is much faster than the first one.
Is this a GCC bug?
There is no such thing as a vectorized integral modulo operation in the x64 architecture. This means that the code by itself is not inherently vectorizable, and needs to be transformed beforehand before that can be done.
You can see the vectorization working just fine in both cases in the much easier case where a SIMD-friendly evenness test is used: https://godbolt.org/z/hc5ffbePY
So if anything, it could be argued that GCC managing to vectorize the inlined version at all, and clang inlining both of them, is actually pretty impressive.
That being said, since we know for a fact that GCC is capable of performing that transformation, it would appear that it is only performed before inlining happens, which is unfortunate, and probably deserves being brought up to the maintainer's attention.
It's a quirk of the code generation. There is no reason why the lambda version shouldn't be vectorized. In fact, clang vectorizes it as-is. If you specify return type as int, GCC vectorizes it too:
auto is_even = [](int x) -> int { return x % 2 == 0; };
If you use std::accumulate, it's also vectorized. You can report this to GCC so they can fix it.

Constexpr Factorial Compilation Results in VS2015 and GCC 5.4.0

Wondering if the following surprises anyone, as it did me? Alex Allain's article here on using constexpr shows the following factorial example:
constexpr factorial (int n)
{
return n > 0 ? n * factorial( n - 1 ) : 1;
}
And states:
Now you can use factorial(2) and when the compiler sees it, it can
optimize away the call and make the calculation entirely at compile
time.
I tried this in VS2015 in Release mode with full optimizations on (/Ox) and stepped through the code in the debugger viewing the assembly and saw that the factorial calculation was not done at compilation.
Using GCC v5.4.0 with --std=C++14, I must use /O2 or /O3 before the calculation is performed at compile time. I was surprised thought that using just /O the calculation did not occur at compilation time.
Main main question is: Why is VS2015 not performing this calculation at compilation time?
It depends on the context of the function call.
For example, the following obviously could never be calculated at compile time:
int x;
std::cin >> x;
std::cout << factorial(x);
On the other hand, this context would require the answer at compile time:
class Foo {
int x[factorial(4)];
};
constexpr functions are only guaranteed to be evaluated at compile time if they are called from a constexpr context; otherwise it is up to the compiler to choose whether or not to eval at compile time (assuming such an optimization is possible, again, depending on the context).
You have to use it in const expression, as:
constexpr auto res = factorial(2);
else computation can be done at runtime.
constexpr is neither necessary nor sufficient to compile time evaluation of a function.
It's not sufficient, even aside from the fact that the arguments obviously also have to be constant expressions. Even if that is true, a conforming compiler does not have to evaluate it at compile time. It only has to be evaluated at compile time if it is in a constexpr context. Such as, assigning the result of the computation to a constexpr variable, or using the value as an array size, or as a non-type template parameter.
The other point, is that the compiler is completely capable of evaluating things at compile time, even without constexpr. There is a lot of confusion about this, and it's not clear why. compile time evaluation of constexpr functions fundamentally just boils down to constant propagation, and compilers have been doing this optimization since forever: https://godbolt.org/g/Sy214U.
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n-1);
}
int foo() { return factorial(5); }
On gcc 6.3 with O3 (and 14) yields:
foo():
mov eax, 120
ret
In essence, outside of the specific case where you absolutely force compile time evaluation by assigning a constexpr function to another constexpr variable, compile time evaluation has more to do with the quality of your optimizer than the standard.

How can the compile-time be (exponentially) faster than run-time?

The below code calculates Fibonacci numbers by an exponentially slow algorithm:
#include <cstdlib>
#include <iostream>
#define DEBUG(var) { std::cout << #var << ": " << (var) << std::endl; }
constexpr auto fib(const size_t n) -> long long
{
return n < 2 ? 1: fib(n - 1) + fib(n - 2);
}
int main(int argc, char *argv[])
{
const long long fib91 = fib(91);
DEBUG( fib91 );
DEBUG( fib(45) );
return EXIT_SUCCESS;
}
And I am calculating the 45th Fibonacci number at run-time, and the 91st one at compile time.
The interesting fact is that GCC 4.9 compiles the code and computes fib91 in a fraction of a second, but it takes a while to spit out fib(45).
My question: If GCC is smart enough to optimize fib(91) computation and not to take the exponentially slow path, what stops it to do the same for fib(45)?
Does the above mean GCC produces two compiled versions of fib function where one is fast and the other exponentially slow?
The question is not how the compiler optimizes fib(91) calculation (yes! It does use a sort of memoization), but if it knows how to optimize the fib function, why does it not do the same for fib(45)? And, are there two separate compilations of the fib function? One slow, and the other fast?
GCC is likely memoizing constexpr functions (enabling a Θ(n) computation of fib(n)). That is safe for the compiler to do because constexpr functions are purely functional.
Compare the Θ(n) "compiler algorithm" (using memoization) to your Θ(φn) run time algorithm (where φ is the golden ratio) and suddenly it makes perfect sense that the compiler is so much faster.
From the constexpr page on cppreference (emphasis added):
The constexpr specifier declares that it is possible to evaluate the value of the function or variable at compile time.
The constexpr specifier does not declare that it is required to evaluate the value of the function or variable at compile time. So one can only guess what heuristics GCC is using to choose whether to evaluate at compile time or run time when a compile time computation is not required by language rules. It can choose either, on a case-by-case basis, and still be correct.
If you want to force the compiler to evaluate your constexpr function at compile time, here's a simple trick that will do it.
constexpr auto compute_fib(const size_t n) -> long long
{
return n < 2 ? n : compute_fib(n - 1) + compute_fib(n - 2);
}
template <std::size_t N>
struct fib
{
static_assert(N >= 0, "N must be nonnegative.");
static const long long value = compute_fib(N);
};
In the rest of your code you can then access fib<45>::value or fib<91>::value with the guarantee that they'll be evaluated at compile time.
At compile-time the compiler can memoize the result of the function. This is safe, because the function is a constexpr and hence will always return the same result of the same inputs.
At run-time it could in theory do the same. However most C++ programmers would frown at optimization passes that result in hidden memory allocations.
When you ask for fib(91) to give a value to your const fib91 in the source code, the compiler is forced to compute that value from you const expr. It does not compile the function (as you seem to think), just it sees that to compute fib91 it needs fib(90) and fib(89), to compute the it needs fib(87)... so on until he computes fib(1) which is given. This is an $O(n)$ algorithm and the result is computed fast enough.
However when you ask to evaluate fib(45) in runtime the compiler has to choose wether using the actual function call or precompute the result. Eventually it decides to use the compiled function. Now, the compiled function must execute exactly the exponential algorithm that you have decided there is no way the compiler could implement memoization to optimize a recursive function (think about the need to allocate some cache and to understand how many values to keep and how to manage them between function calls).

Function pointer runs faster than inline function. Why?

I ran a benchmark of mine on my computer (Intel i3-3220 # 3.3GHz, Fedora 18), and got very unexpected results. A function pointer was actually a bit faster than an inline function.
Code:
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
std::chrono::duration<double> t;
int total=0;
for(int i=0;i<10000000;i++)
{
auto begin=std::chrono::high_resolution_clock::now();
short a=toBigEndian((short)i);//toBigEndianPtr((short)i);
total+=a;
auto end=std::chrono::high_resolution_clock::now();
t+=std::chrono::duration_cast<std::chrono::duration<double>>(end-begin);
}
std::cout<<t.count()<<", "<<total<<std::endl;
return 0;
}
compiled with
g++ test.cpp -std=c++0x -O0
The 'toBigEndian' loop finishes always at around 0.26-0.27 seconds, while 'toBigEndianPtr' takes 0.21-0.22 seconds.
What makes this even more odd is that when I remove 'total', the function pointer becomes the slower one at 0.35-0.37 seconds, while the inline function is at about 0.27-0.28 seconds.
My question is:
Why is the function pointer faster than the inline function when 'total' exists?
Short answer: it isn't.
You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.
So, to give your measurements any meaning,
optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of
A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).
Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.
Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.
Modified code follows:
#include <iostream>
static inline uint64_t rdtsc(void)
{
uint32_t hi, lo;
asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
uint64_t t = 0, begin=0, end=0;
int total=0;
begin=rdtsc();
for(int i=0;i<LOOP_COUNT;i++)
{
short a=0;
a=toBigEndianPtr((short)i);
//a=toBigEndian((short)i);
total+=a;
}
end=rdtsc();
t+=(end-begin);
std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
return 0;
}
Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
int total=0;
auto begin=std::chrono::high_resolution_clock::now();
for(int i=0;i<100000000;i++)
{
short a=toBigEndianPtr((short)i);
total+=a;
}
auto end=std::chrono::high_resolution_clock::now();
std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
return 0;
}
the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.
First off, with -O0, you aren't running the optimizer, which means the compiler is ignoring your request to inline, as it is free to do. The cost of the two different calls ought to be nearly identical. Try with -O2.
Second, if you are only running for 0.22 seconds, weirdly variable costs involved with starting your program totally dominate the cost of running the test function. That function call is just a few instructions. If your CPU is running at 2 GHz, it ought to execute that function call in something like 20 nanoseconds, so you can see that whatever it is you're measuring, it's not the cost of running that function.
Try calling the test function in a loop, say 1,000,000 times. Make the number of loops 10x bigger until it takes > 10 seconds to run the test. Then divide the result by the number of loops for an approximation of the cost of the operation.
With many/most self-respecting modern compilers, the code you posted will still inline the function call even when when it is called through the pointer. (Assuming the compiler makes a reasonable effort to optimize the code). The situation is just too easy to see through. In other words, the generated code can easily end up virtually the same in both cases, meaning that your test is not really useful for measuring what you are trying to measure.
If you really want to make sure the call is physically performed through the pointer, you have to make an effort to "confuse" the compiler to the point where it can't figure out the pointer value at compile time. For example, make the pointer value run-time dependent, as in
toBigEndianPtr = rand() % 1000 != 0 ? toBigEndian : NULL;
or something along these lines. You can also declare your function pointer as volatile, which will typically cause a genuine through-the-pointer call each time as well as force the compiler to re-read the pointer value from memory on each iteration.

Overloading operators with C++ metaprogramming templates

I'm using some template meta-programming to solve a small problem, but the syntax is a little annoying -- so I was wondering, in the example below, will overloading operators on the meta-class that has an empty constructor cause a (run-time) performance penalty? Will all the temporaries actually be constructed or can it be assumed that they will be optimized out?
template<int value_>
struct Int {
static const int value = value_;
template<typename B>
struct Add : public Int<value + B::value> { };
template<typename B>
Int<value + B::value> operator+(B const&) { return Int<value + B::value>(); }
};
int main()
{
// Is doing this:
int sum = Int<1>::Add<Int<2> >().value;
// any more efficient (at runtime) than this:
int sum = (Int<1>() + Int<2>()).value;
return sum;
}
Alright, I tried my example under GCC.
For the Add version with no optimization (-O0), the resulting assembly just loads a constant into sum, then returns it.
For the operator+ version with no optimization (-O0), the resulting assembly does a bit more (it appears to be calling operator+).
However, with -O3, both versions generate the same assembly, which simply loads 3 directly into the return register; the temporaries, function calls, and sum had been optimized out entirely in both cases.
So, they're equally fast with a decent compiler (as long as optimizations are turned on).
Compare assembly code generated by g++ -O3 -S for both solutions. It gives same code for both solutions. It actually optimize code to simply return 3.