How can the compile-time be (exponentially) faster than run-time?

How can the compile-time be (exponentially) faster than run-time? - c++

The below code calculates Fibonacci numbers by an exponentially slow algorithm:
#include <cstdlib>
#include <iostream>
#define DEBUG(var) { std::cout << #var << ": " << (var) << std::endl; }
constexpr auto fib(const size_t n) -> long long
{
return n < 2 ? 1: fib(n - 1) + fib(n - 2);
}
int main(int argc, char *argv[])
{
const long long fib91 = fib(91);
DEBUG( fib91 );
DEBUG( fib(45) );
return EXIT_SUCCESS;
}
And I am calculating the 45th Fibonacci number at run-time, and the 91st one at compile time.
The interesting fact is that GCC 4.9 compiles the code and computes fib91 in a fraction of a second, but it takes a while to spit out fib(45).
My question: If GCC is smart enough to optimize fib(91) computation and not to take the exponentially slow path, what stops it to do the same for fib(45)?
Does the above mean GCC produces two compiled versions of fib function where one is fast and the other exponentially slow?
The question is not how the compiler optimizes fib(91) calculation (yes! It does use a sort of memoization), but if it knows how to optimize the fib function, why does it not do the same for fib(45)? And, are there two separate compilations of the fib function? One slow, and the other fast?

GCC is likely memoizing constexpr functions (enabling a Θ(n) computation of fib(n)). That is safe for the compiler to do because constexpr functions are purely functional.
Compare the Θ(n) "compiler algorithm" (using memoization) to your Θ(φn) run time algorithm (where φ is the golden ratio) and suddenly it makes perfect sense that the compiler is so much faster.
From the constexpr page on cppreference (emphasis added):
The constexpr specifier declares that it is possible to evaluate the value of the function or variable at compile time.
The constexpr specifier does not declare that it is required to evaluate the value of the function or variable at compile time. So one can only guess what heuristics GCC is using to choose whether to evaluate at compile time or run time when a compile time computation is not required by language rules. It can choose either, on a case-by-case basis, and still be correct.
If you want to force the compiler to evaluate your constexpr function at compile time, here's a simple trick that will do it.
constexpr auto compute_fib(const size_t n) -> long long
{
return n < 2 ? n : compute_fib(n - 1) + compute_fib(n - 2);
}
template <std::size_t N>
struct fib
{
static_assert(N >= 0, "N must be nonnegative.");
static const long long value = compute_fib(N);
};
In the rest of your code you can then access fib<45>::value or fib<91>::value with the guarantee that they'll be evaluated at compile time.

At compile-time the compiler can memoize the result of the function. This is safe, because the function is a constexpr and hence will always return the same result of the same inputs.
At run-time it could in theory do the same. However most C++ programmers would frown at optimization passes that result in hidden memory allocations.

When you ask for fib(91) to give a value to your const fib91 in the source code, the compiler is forced to compute that value from you const expr. It does not compile the function (as you seem to think), just it sees that to compute fib91 it needs fib(90) and fib(89), to compute the it needs fib(87)... so on until he computes fib(1) which is given. This is an $O(n)$ algorithm and the result is computed fast enough.
However when you ask to evaluate fib(45) in runtime the compiler has to choose wether using the actual function call or precompute the result. Eventually it decides to use the compiled function. Now, the compiled function must execute exactly the exponential algorithm that you have decided there is no way the compiler could implement memoization to optimize a recursive function (think about the need to allocate some cache and to understand how many values to keep and how to manage them between function calls).

Related

Why can't I pass a constexpr function to std::cout? [duplicate]

This question already has answers here:
When does a constexpr function get evaluated at compile time?
(2 answers)
Closed 2 years ago.
This code, when compiled with g++ -O3, does not seem to evaluate get_fibonacci(50) at compile time - as it runs for a very long time.
#include <iostream>
constexpr long long get_fibonacci(int num){
if(num == 1 || num == 2){return 1;}
return get_fibonacci(num - 1) + get_fibonacci(num - 2);
}
int main()
{
std::cout << get_fibonacci(50) << std::endl;
}
Replacing the code with
#include <iostream>
constexpr long long get_fibonacci(int num){
if(num == 1 || num == 2){return 1;}
return get_fibonacci(num - 1) + get_fibonacci(num - 2);
}
int main()
{
long long num = get_fibonacci(50);
std::cout << num << std::endl;
}
worked perfectly fine. I don't know exactly why this is occurring, but my guess is that get_fibonacci(50) is not evaluated at compile-time in the first scenario because items given std::cout are evaluated at runtime. Is my reasoning correct, or is something else happening? Can somebody please point me in the right direction?

Actually, both versions of your code do not have the Fibonnaci number computed at compile-time, with typical compilers and compilation flags. But, interestingly enough, if you reduce the 50 to be, say, 30, both versions of your program do have the compile-time evaluation.
Proof: GodBolt
At the link, your first program is compiled and run first with 50 as the argument to get_fibbonacci(), then with 30, using GCC 10.2 and clang 11.0.
What you're seeing is the limits of the compiler's willingness to evaluate code at compile-time. Both compilers engage in the recursive evaluation at compile time - until a certain depth, or certain evaluation time cap, has elapsed. They then give up and leave it for run-time evaluation.

I don't know exactly why this is occurring, but my guess is that get_fibonacci(50) is not evaluated at compile-time in the first scenario because items given std::cout are evaluated at runtime
Your function can be computed compile-time, because receive a compile-time know value (50), but can also computed run-time, because the returned value is send to standard output so it's used run-time.
It's a gray area where the compiler can choose both solutions.
To impose (ignoring the as-if rule) the compile-time computation, you can place the returned value in a place where the value is required compile-time.
For example, in a template parameter, in your first example
std::cout << std::integral_constant<long long, get_fibonacci(50)>::value
<< std::endl;
or in a constexpr variable, in your second example
constexpr long long num = get_fibonacci(50);
But remember there is the "as-if rule", so the compiler (in this case, also using constexpr or std::integral_constant) can select the run-time solution because this "do not change the observable behavior of the program".

Assign to a constexpr to get the compiler to spit out an error message
constexpr auto val = get_fibonacci(50);

constexpr functions are evaluated at compile time only in constexpr context, which includes assignment to constexpr variables, template parameter, array size...
Regular function/operator call is not such context.
std::cout << get_fibonacci(50);
is done at runtime.
Now, compiler might optimize any (constexpr or not, inline or not) functions with the as-if rule, resulting in a constant, a simpler loop, ...

Tail call optimisation seems to slightly worsen performance

In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.
size of array with TLO(ms) without TLO(ms)
8M 35,083 34,051
4M 8,952 8,627
1M 613 609
Below is my code:
#include <bits/stdc++.h>
using namespace std;
int N = 4000000;
void qsort(int* arr,int start=0,int finish=N-1){
if(start>=finish) return ;
int i=start+1,j = finish,temp;
auto pivot = arr[start];
while(i!=j){
while (arr[j]>=pivot && j>i) --j;
while (arr[i]<pivot && i<j) ++i;
if(i==j) break;
temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
}
if(arr[i]>=arr[start]) --i;
temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
qsort(arr,start,i-1);
qsort(arr,i+1,finish);
}
int main(){
srand(time(NULL));
int* arr = new int[N];
for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}
auto start = clock();
qsort(arr);
cout<<(clock()-start)<<endl;
return 0;
}
I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.
EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?

On speed:
As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.
However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.
If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.
On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.
The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.
So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.
I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.
There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):
//fib.cpp
#include <iostream>
unsigned long long int fib(unsigned long long int n){
if (n==0 || n==1)
return 1;
return fib(n-1)+fib(n-2);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fib(N);
}
running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.
On tail call optimization:
Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form
ResType f( InputType input){
//do work
InputType new_input = ...;
return f(new_input);
}
There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.
Let's take look at this simple and standard implementation of tail call recursion:
//should be called fac(n,1)
unsigned long long int
fac(unsigned long long int n, unsigned long long int res_so_far){
if (n==0)
return res_so_far;
return fac(n-1, res_so_far*n);
}
This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!
However, the gcc compiler is able to perform the TCO also for less obvious cases:
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!
At hand of this example, we can see the result of TCO at work:
//factorial.cpp
#include <iostream>
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fac(N);
}
Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.
Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.
Corollary:
As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.
However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.
That means you should check for which part of the array you call the quicksort first as it plays a huge role.

Constexpr Factorial Compilation Results in VS2015 and GCC 5.4.0

Wondering if the following surprises anyone, as it did me? Alex Allain's article here on using constexpr shows the following factorial example:
constexpr factorial (int n)
{
return n > 0 ? n * factorial( n - 1 ) : 1;
}
And states:
Now you can use factorial(2) and when the compiler sees it, it can
optimize away the call and make the calculation entirely at compile
time.
I tried this in VS2015 in Release mode with full optimizations on (/Ox) and stepped through the code in the debugger viewing the assembly and saw that the factorial calculation was not done at compilation.
Using GCC v5.4.0 with --std=C++14, I must use /O2 or /O3 before the calculation is performed at compile time. I was surprised thought that using just /O the calculation did not occur at compilation time.
Main main question is: Why is VS2015 not performing this calculation at compilation time?

It depends on the context of the function call.
For example, the following obviously could never be calculated at compile time:
int x;
std::cin >> x;
std::cout << factorial(x);
On the other hand, this context would require the answer at compile time:
class Foo {
int x[factorial(4)];
};
constexpr functions are only guaranteed to be evaluated at compile time if they are called from a constexpr context; otherwise it is up to the compiler to choose whether or not to eval at compile time (assuming such an optimization is possible, again, depending on the context).

You have to use it in const expression, as:
constexpr auto res = factorial(2);
else computation can be done at runtime.

constexpr is neither necessary nor sufficient to compile time evaluation of a function.
It's not sufficient, even aside from the fact that the arguments obviously also have to be constant expressions. Even if that is true, a conforming compiler does not have to evaluate it at compile time. It only has to be evaluated at compile time if it is in a constexpr context. Such as, assigning the result of the computation to a constexpr variable, or using the value as an array size, or as a non-type template parameter.
The other point, is that the compiler is completely capable of evaluating things at compile time, even without constexpr. There is a lot of confusion about this, and it's not clear why. compile time evaluation of constexpr functions fundamentally just boils down to constant propagation, and compilers have been doing this optimization since forever: https://godbolt.org/g/Sy214U.
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n-1);
}
int foo() { return factorial(5); }
On gcc 6.3 with O3 (and 14) yields:
foo():
mov eax, 120
ret
In essence, outside of the specific case where you absolutely force compile time evaluation by assigning a constexpr function to another constexpr variable, compile time evaluation has more to do with the quality of your optimizer than the standard.

template work as recursive function

I found this code, and wondering whether I should really implement something like this in my real project or not.
And things confusing me are
It will take more compile time, but I should not bother about
compile time at the cost of run time.
And what if N gets really a very big number then? Is there any file
size limit in source code?
or its something good to know, not to implement?
#include <iostream>
using namespace std;
template<int N>
class Factorial {
public:
static const int value = N * Factorial<N-1>::value;
};
template <>
class Factorial<1> {
public:
static const int value = 1;
};
int main() {
Factorial<5L> f;
cout << "5! = " << f.value << endl;
}
outPut:
5! = 120
slight modification in question, as i was playing with the code, found that
Factorial<12> f1; // works
Factorial<13> f2; // doesn't work
error:
undefined reference to `Factorial<13>::value'
is it like, can go up-to 12 depth not further?

The answer to 1 is that it depends, template meta programming essentially involves a trade-off between doing the calculation at compile-time with the benefit that it does not have to be done at run-time. In general this technique can lead to hard to read and maintain code. So the answer ultimately depends on your need for faster run-time performance over slower compiler times and possibly harder to maintain code.
The article Want speed? Use constexpr meta-programming! explains how in modern C++ you can use constexpr functions many times as a replacement for template meta programming. This in general leads to code that is more readable and perhaps faster. Compare this template meta programming method to the constexpr example:
constexpr int factorial( int n )
{
return ( n == 0 ? 1 : n*factorial(n-1) ) ;
}
which is more concise, readable and will be executed at compile time for arguments that are constant expressions although as the linked answer explains the standard does not actually say it must be but in practice current implementations definitely do.
It is also worth noting that since the result will quickly overflow value that another advantage of constexpr is that undefined behavior is not a valid constant expression and at least the current implementations of gcc and clang will turn undefined behavior within a constexpr into an error for most cases. For example:
constexpr int n = factorial(13) ;
for me generates the following error:
error: constexpr variable 'n' must be initialized by a constant expression
constexpr int n = factorial(13) ;
^ ~~~~~~~~~~~~~
note: value 6227020800 is outside the range of representable values of type 'int'
return ( n == 0 ? 1 : n*factorial(n-1) ) ;
^
This is also why you example:
Factorial<13> f2;
fails because a constant expression is required and gcc 4.9 gives a useful error:
error: overflow in constant expression [-fpermissive]
static const int value = N * Factorial<N-1>::value;
^
although older versions of gcc give you the less than helpful error you are seeing.
For question 2, compilers have a template recursion limit, which can usually be configured but eventually you will run out of system resources. For example the flag for gcc would be -ftemplate-depth=n:
Set the maximum instantiation depth for template classes to n. A limit
on the template instantiation depth is needed to detect endless
recursions during template class instantiation. ANSI/ISO C++
conforming programs must not rely on a maximum depth greater than 17
(changed to 1024 in C++11). The default value is 900, as the compiler
can run out of stack space before hitting 1024 in some situations.
As in your specific problem you will need to worry about signed integer overflow, which is undefined behavior before you have system resource issues.

constexpr question, why do these two different programs run in such a different amount of time with g++?

I'm using gcc 4.6.1 and am getting some interesting behavior involving calling a constexpr function. This program runs just fine and straight away prints out 12200160415121876738.
#include <iostream>
extern const unsigned long joe;
constexpr unsigned long fib(unsigned long int x)
{
return (x <= 1) ? 1 : (fib(x - 1) + fib(x - 2));
}
const unsigned long joe = fib(92);
int main()
{
::std::cout << "Here I am!\n";
::std::cout << joe << '\n';
return 0;
}
This program takes forever to run and I've never had the patience to wait for it to print out a value:
#include <iostream>
constexpr unsigned long fib(unsigned long int x)
{
return (x <= 1) ? 1 : (fib(x - 1) + fib(x - 2));
}
int main()
{
::std::cout << "Here I am!\n";
::std::cout << fib(92) << '\n';
return 0;
}
Why is there such a huge difference? Am I doing something wrong in the second program?
Edit: I'm compiling this with g++ -std=c++0x -O3 on a 64-bit platform.

joe is an Integral Constant Expression; it must be usable in array bounds. For that reason, a reasonable compiler will evaluate it at compile time.
In your second program, even though the compiler could calculate it at compile time, there's no reason why it must.

My best guess would be that program number one is had fib(92) evaluated at compile time, with lots of tables and stuff for the compiler to keep track of what values has already been evaluated... making running the program almost trivial,
Where as the second version is actually evaluated at run-time without lookup tables of evaluated constant expressions, meaning that the evaluating of fib(92) makes something like 2**92 recursive calls.
In other words the compiler does not optimize the fact that fib(92) is a constant expression.

There's wiggle room for the compiler to decide not to evaluate at compile time if it thinks something is "too complicated". That's in cases where it's not being absolutely forced to do the evaluation in order to generate a correct program that can actually be run (as #MSalters points out).
I thought perhaps the decision affecting compile-time laziness would be the recursion depth limit. (That's suggested in the spec as 512, but you can bump it up with the command line flag -fconstexpr-depth if you wanted it to.) But rather that would control it giving up in any cases...even when a compile time constant was necessary to run the program. So no effect on your case.
It seems if you want a guarantee in the code that it will do the optimization then you've found a technique for that. But if constexpr-depth can't help, I'm not sure if there are any relevant compiler flags otherwise...

I also wanted to see how gcc did optimize the code for this new constexpr keyword, and actually it's just because you are calling fib(92) as parameter of the ofstream::operator<<
::std::cout << fib(92) << '\n';
that it isn't evaluated at the compilation time, if you try calling it not as a parameter of another function (like you did in)
const unsigned long joe = fib(92);
it is evaluated at compile time, I did a blog post about this if you want more info, I don't know if this should be mentioned to gcc developers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can the compile-time be (exponentially) faster than run-time? - c++

Related

Why can't I pass a constexpr function to std::cout? [duplicate]

Tail call optimisation seems to slightly worsen performance

Constexpr Factorial Compilation Results in VS2015 and GCC 5.4.0

template work as recursive function

constexpr question, why do these two different programs run in such a different amount of time with g++?

Categories

Resources