I'm was messing around with tail-recursive functions in C++, and I've run into a bit of a snag with the g++ compiler.
The following code results in a stack overflow when numbers[] is over a couple hundred integers in size. Examining the assembly code generated by g++ for the following reveals that twoSum_Helper is executing a recursive call instruction to itself.
The question is which of the following is causing this?
A mistake in the following that I am overlooking which prevents tail-recursion.
A mistake with my usage of g++.
A flaw in the detection of tail-recursive functions within the g++ compiler.
I am compiling with g++ -O3 -Wall -fno-stack-protector test.c on Windows Vista x64 via MinGW with g++ 4.5.0.
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2);
else
return twoSum_Helper(numbers, size, target, i, j + 1);
}
Tail call optimization in C or C++ is extremely limited, and pretty much a lost cause. The reason is that there generally is no safe way to tail-call from a function that passes a pointer or reference to any local variable (as an argument to the call in question, or in fact any other call in the same function) -- which of course is happening all over the place in C/C++ land, and is almost impossible to live without.
The problem you are seeing is probably related: GCC likely compiles returning a struct by actually passing the address of a hidden variable allocated on the caller's stack into which the callee copies it -- which makes it fall into the above scenario.
Try compilling with -O2 instead of -O3.
How do I check if gcc is performing tail-recursion optimization?
well, it doesn't work with O2 anyway. The only thing that seems to work is returning the result object into a reference that is given as a parameter.
but really, it's much easier to just remove the Tail call and use a loop instead. TCO is here to optimize tail call that are found when inlining or when performing agressive unrolling, but you shouldn't attempt to use recursion when handling large values anyway.
I can't get g++ 4.4.0 (under mingw) to perform tail recursion, even on this simple function:
static void f (int x)
{
if (x == 0) return ;
printf ("%p\n", &x) ; // or cout in C++, if you prefer
f (x - 1) ;
}
I've tried -O3, -O2, -fno-stack-protector, C and C++ variants. No tail recursion.
I would look at 2 things.
The return call in the if statement is going to have a branch target for the else in the stack frame for the current run of the function that needs to be resolved post call (which would mean any TCO attempt would not be able overwrite the executing stack frame thus negating the TCO)
The numbers[] array argument is a variable length data structure which could also prevent TCO because in TCO the same stack frame is used in one way or another. If the call is self referencing (like yours) then it will overwrite the stack defined variables (or locally defined) with the values/references of the new call. If the tail call is to another function then it will overwrite the entire stack frame with the new function (in a case where TCO can be done in A => B => C, TCO could make this look like A => C in memory during execution). I would try a pointer.
It has been a couple months since I have built anything in C++ so I didn't run any tests, but I think one/both of those are preventing the optimization.
Try changing your code to:
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if(j >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i, i + 1);
}
edit: removed unnecessary parameter as per comment from asker
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i)
{
if (numbers[i] + numbers[i+1] == target || i >= (size - 1))
return gen_Result(i, i+1, true);
if(i+1 >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i);
}
Support of Tail Call Optimization (TCO) is limited in C/C++.
So, if the code relies on TCO to avoid stack overflow it may be better to rewrite it with a loop. Otherwise some auto test is needed to be sure that the code is optimized.
Typically TCO may be suppressed by:
passing pointers to objects on stack of recursive function to external functions (in case of C++ also passing such object by reference);
local object with non-trivial destructor even if the tail recursion is valid (the destructor is called before the tail return statement), for example Why isn't g++ tail call optimizing while gcc is?
Here TCO is confused by returning structure by value.
It can be fixed if the result of all recursive calls will be written to the same memory address allocated in other function twoSum (similarly to the answer https://stackoverflow.com/a/30090390/4023446 to Tail-recursion not happening)
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
struct result* twoSum_Helper(int numbers[], int size, int target,
int i, int j, struct result* res_)
{
if (i >= (size - 1)) {
*res_ = gen_Result(i, j, false);
return res_;
}
if (numbers[i] + numbers[j] == target) {
*res_ = gen_Result(i, j, true);
return res_;
}
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2, res_);
else
return twoSum_Helper(numbers, size, target, i, j + 1, res_);
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum(int numbers[], int size, int target)
{
struct result r;
return *twoSum_Helper(numbers, size, target, 0, 1, &r);
}
The value of res_ pointer is constant for all recursive calls of twoSum_Helper.
It can be seen in the assembly output (the -S flag) that the twoSum_Helper tail recursion is optimized as a loop even with two recursive exit points.
Compile options: g++ -O2 -S (g++ version 4.7.2).
I have heard others complain, that tail recursion is only optimized with gcc and not g++.
Could you try using gcc.
Since the code of twoSum_Helper is calling itself it shouldn't come as a surprise that the assembly shows exactly that happening. That's the whole point of a recursion :-) So this hasn't got anything to do with g++.
Every recursion creates a new stack frame, and stack space is limited by default. You can increase the stack size (don't know how to do that on Windows, on UNIX the ulimit command is used), but that only defers the crash.
The real solution is to get rid of the recursion. See for example this question and this question.
Related
I have been programming using the C++ language for quite some time now. I recently came across a situation for which I need help. For a recursive call without a base condition, why does the compiler not show an error during compilation? I, however, receive an error message during runtime.
Take the following for an example. Thanks!
#include <iostream>
#include <climits>
int fibonacci(int n){
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main(){
int ans = fibonacci(6);
std::cout << ans << std::endl;
}
The premise of the question is false. GCC reports:
: In function 'int fibonacci(int)':
:6:5: warning: infinite recursion detected [-Winfinite-recursion]
6 | int fibonacci(int n){
| ^~~~~~~~~
:8:21: note: recursive call
8 | return fibonacci(n - 1) + fibonacci(n - 2);
| ~~~~~~~~~^~~~~~~
Clang reports:
:6:21: warning: all paths through this function will call itself [-Winfinite-recursion]
int fibonacci(int n){
^
1 warning generated.
MSVC reports:
(9) : warning C4717: 'fibonacci': recursive on all control paths, function will cause runtime stack overflow
Modern compilers, in their quest to help you out and generate near-optimal code, will indeed recognize that this function never terminates. However, nothing in the C or C++ language specifications requires that. In contrast to languages like Prolog or Haskell, C/C++ do not guarantee any semantic analysis of your program. A very simple compiler would turn your code
int fibonacci(int n){
return fibonacci(n - 1) + fibonacci(n - 2);
}
into a sequence of low-level instructions equivalent to
set a = n - 1
set b = n - 2
put a in the stack position or register for the first int argument
call function fibonacci
move the return value into temporary x
put b in the stack position or register for the first int argument
call function fibonacci
move the return value into temporary y
set z = x + y
move z into the stack position or register for the function return value
return to caller
This is a perfectly legal compilation of your program, and does not require any errors or warnings to be generated. Obviously, during execution, the "move the return value into temporary x" and later instructions (most significantly, the "return to caller") will never be reached. This will generate an infinite recursion loop until the machine stack space is exhausted.
I understand I can disable tail recursion optimization in GCC with the option -fno-optimize-sibling-calls. However, it disables the optimization for the whole compilation unit.
Is there a way to disable the optimization for a single function?
It is my understanding that I can change the function so it's not a valid candidate for tail recursion - say, by using the return value in an expression so the return is not the last instruction in the function (e.g.: return f(n) + 1;).
The solution above may still be optimizable, though, and future (or current, I don't know) versions of the compiler may be smart enough to make it into a tail call - say, by changing int f(i) { if(!i) return 0; return f(i - 1) + 1; } into int f(i, r = 0) { if(!i) return r; return f(i - 1, r + 1); }
I'm looking for a cleaner and future proof solution that doesn't require changing the algorithm, if at all possible.
Looking through the documentation I couldn't find a function attribute or built-in that does that, but my search hasn't been exhaustive.
You may be able to use the GCC-specific #pragma optimize() directive (combined with suitable bracketing with push/pop #pragma lines) to achieve a result similar to specifying a function attribute:
#pragma GCC push_options // Save current options
#pragma GCC optimize ("no-optimize-sibling-calls")
int test(int i)
{
if (i == 1) return 0;
return i + test(i - 1);
}
#pragma GCC pop_options // Restore saved options
int main()
{
int i = 5;
int j = test(i);
return j;
}
But note that clang doesn't support this form of #pragma optimize. Also, note this warning from the manual:
Not every optimization option that starts with the -f prefix specified
by the attribute necessarily has an effect on the function. The
optimize attribute should be used for debugging purposes only. It is
not suitable in production code.
__attribute__((__optimize__("no-optimize-sibling-calls"))) appears to work on GCC.
Clang gives warning: unknown attribute '__optimize__' ignored.
In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.
size of array with TLO(ms) without TLO(ms)
8M 35,083 34,051
4M 8,952 8,627
1M 613 609
Below is my code:
#include <bits/stdc++.h>
using namespace std;
int N = 4000000;
void qsort(int* arr,int start=0,int finish=N-1){
if(start>=finish) return ;
int i=start+1,j = finish,temp;
auto pivot = arr[start];
while(i!=j){
while (arr[j]>=pivot && j>i) --j;
while (arr[i]<pivot && i<j) ++i;
if(i==j) break;
temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
}
if(arr[i]>=arr[start]) --i;
temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
qsort(arr,start,i-1);
qsort(arr,i+1,finish);
}
int main(){
srand(time(NULL));
int* arr = new int[N];
for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}
auto start = clock();
qsort(arr);
cout<<(clock()-start)<<endl;
return 0;
}
I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.
EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?
On speed:
As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.
However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.
If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.
On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.
The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.
So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.
I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.
There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):
//fib.cpp
#include <iostream>
unsigned long long int fib(unsigned long long int n){
if (n==0 || n==1)
return 1;
return fib(n-1)+fib(n-2);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fib(N);
}
running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.
On tail call optimization:
Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form
ResType f( InputType input){
//do work
InputType new_input = ...;
return f(new_input);
}
There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.
Let's take look at this simple and standard implementation of tail call recursion:
//should be called fac(n,1)
unsigned long long int
fac(unsigned long long int n, unsigned long long int res_so_far){
if (n==0)
return res_so_far;
return fac(n-1, res_so_far*n);
}
This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!
However, the gcc compiler is able to perform the TCO also for less obvious cases:
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!
At hand of this example, we can see the result of TCO at work:
//factorial.cpp
#include <iostream>
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fac(N);
}
Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.
Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.
Corollary:
As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.
However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.
That means you should check for which part of the array you call the quicksort first as it plays a huge role.
In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.
I was writing factorial using tail recursion and I have a question here. My original function looks like this
Code snippet A
#include <stdio.h>
int main(void)
{
int n = 0;
printf("Enter number to find factorial of : ");
scanf("%d",&n);
printf("fact == %d\n",fun(n,1));
return 0;
}
int fun(int n, int sofar)
{
int ret = 0;
if(n == 0)
return sofar;
ret = fun(n-1,sofar*n);
return ret;
}
However, even if I do not use return, it still works. This does not make perfect sense since I am returning the value only in the base case. Say if n==5, then 120 would be returned at the base case. But what gets returned from 4th invocation back to the 3rd invocation cannot be predicted since we are not explicitly specifying any return unlike in Code snippet A.
Code snippet B
int fun(int n, int sofar)
{
int ret = 0;
if(n == 0)
return sofar;
ret = fun(n-1,sofar*n);
}
I am thinking the above works because of some kind of compiler optimization ? Because If I add a printf statement to Code snippet B, it does not work anymore.
Code snippet C
int fun(int n, int sofar)
{
int ret = 0;
if(n == 0)
return sofar;
ret = fun(n-1,sofar*n);
printf("now it should not work\n");
}
Probably the printf causes something to be removed from the stack? Please help me understand this.
Not returning a value from a fuction which should return a value is undefined behavior.
If it works by any luck then it's not an optimization but just a coincidence given by how and where these automatic allocated values are stored.
One could speculate about the reason why this works, but there is no way to give a definitive answer: reaching the end of a value-returning function without the return statement and using the return value is undefined behavior, with or without optimization.
The reason this "works" in your compiler is that the return mechanism used by your compiler happens to have the right value at the time the end of function is reached. For example, if your compiler returns integers in the same register that has been used for the last computation in your code (i.e. ret = fun(n-1,sofar*n)) then the right value would be loaded into the return register by accident, masking undefined behavior.
It works because the return value is almost always stored in a specific CPU register (eax for x86). This means that is you don't explicitly return a value, the return register will not be explicitly set. Because of that, its value can be anything, but it is often the return value of the last called function. Thus, ending a function with myfunc() is almost guaranteed to have the same behavior as return myfunc(); (but it's still undefined behavior).
Here is the reason 'something' is calculated.
the call to printf() has a returned value (very rarely used) of the number of characters printed (including tabs, newlines, etc)
in the 'C' code snippet, printf() is returning 23.
'int' returned values are always returned in the same register.
The printf() set that register to 23.
So, something is returned, nothing was removed from the stack.
The reason why it seems to work in B is because the return value is most likely passed in a register on your architecture. So returning through all the layers of recursion (or if the compiler optimized the whole thing into iteration) nothing touches that register and your code appears to work.
A different compiler might not allow this to happen or maybe the next version of your compiler will optimize this differently. In fact, the compiler can remove most of the function because it is allowed to assume that since undefined behavior can't happen it must mean that the part of the function where the undefined behavior seems to happen will never be reached and can be safely removed.