c++ recursion exits without obvious reason - c++

I wrote a function using a recursion. While testing it, it turned out, that the function is killed without any obvious reason, while the recursion is still running.
To test this, I wrote an infinite recursion.
On my PC this function quits after about 2 seconds and the last output is about 327400.
The last number isn't always the same.
I am using Ubuntu Lucid Lynx, the GCC compiler and Eclipse as IDE. If somebody has an idea what the problem is and how I can prevent the program from exiting I would be really pleased.
#include <iostream>
void rek(double x){
std::cout << x << std::endl;
rek(x + 1);
}
int main(int argc, char **argv) {
rek(1);
}

You are most likely overflowing the stack, at which point your program will be summarily killed. The depth of the stack will always limit the amount you can recurse, and if you are hitting that limit, it means your algorithm needs to change.

I think you are right in expecting the code to run forever, as explained in
How do I check if gcc is performing tail-recursion optimization?
your code should be able to run for ever and ever, if gcc is performing tail recursion. On my machine it looks like -O3 actually makes gcc generate tail calls and actually flatten the stack. :-)
I surgest you set the optimize flag to O2 or O3.

You are causing a stack overflow (running out of stack space) because you don't provide an exit condition.
void rek(double x){
if(x > 10)
return;
std::cout << x << std::endl;
rek(x + 1);
}

Are you expecting this to work forever?
It won't. At some point you're going to run out of stack.

This is funny, talking about stack overflow on stackoverflow.com. ;) The call stack is limited (you can customized it from the project settings), but at some point, when you have infinite loop calls, it will be exceed and your program terminated.

If you want to avoid a stack overflow with infinite recursion, you're unfortunately going to have to delve into some assembly in order to change the stack so that a new activation record isn't constantly pushed onto the stack, which after some point will cause the overflow. Because you make the recursive call at the end of the function, this is called in other languages where recursion is popular (i.e., Lisp, Scheme, Haskell, etc.) a trail-call optimization. It prevents a stack overflow by basically transforming the tail-call into a loop. It would be something like this in C (note: I'm using inline assembly with gcc on x86, and I changed your arguments to int from double in order to simplify the assembly. Also I've changed to C from C++ in order to avoid name-mangling of function-names. Finally the "\n\t" at the end of each statement is not an actual assembly command but is needed for inline assembly in gcc):
#include <stdio.h>
void rek(int x)
{
printf("Value for x: %d\n", x);
//we now duplicate the equvalent of `rek(x+1);` with tail-call optimization
__asm("movl 8(%ebp), %eax\n\t" //get the value of x off the stack
"incl %eax\n\t" //add 1 to the value of x
"movl 4(%ebp), %ecx\n\t" //save the return address on the stack
"movl (%ebp), %edx\n\t" //save the caller's activation record base pointer
"addl $12, %ebp\n\t" //erase the activation record
"movl %ebp, %esp\n\t" //reset the stack pointer
"pushl %eax\n\t" //push the new value of x on the stack for function call
"pushl %ecx\n\t" //push the return value back to the caller (i.e., main()) on the stack
"movl %edx, %ebp\n\t" //restore the old value of the caller's stack base pointer
"jmp rek\n\t"); //jump to the start of rek()
}
int main()
{
rek(1);
printf("Finished call\n"); //<== we never get here
return 0;
}
Compiled with gcc 4.4.3 on Ubuntu 10.04, this ran pretty much "forever" in an infinite loop with no stack overflow, where-as without the tail-call optimization, it crashed with a segmentation fault pretty quickly. You can see from the comments in the __asm section how the stack activation record space is being "recycled" so that each new call does not use up space on the stack. This involves saving the key values in the old activation record (the previous caller's activation record base pointer and the return address), and restoring them, but with the arguments changed for the next recursive call to the function.
And again, other languages, mainly functional languages, perform tail-call optimization as a base-feature of the language. So a tail-call recursive function in Scheme/Lisp/etc. won't overflow the stack since this type of stack manipulation is done under-the-hood for you when a new function call is made as the last statement of an existing function.

Well you have defined infinite recursion and overflowing the stack, which kills your app. If you really want to print all numbers; then use a loop.
int main(...)
{
double x = 1;
while (true)
{
std:cout << x << std::endl;
x += 1;
}
}

Each recursive method should implement an exit condition, otherwise you will get stack overflow and the program will terminate.
In your case, there is no condition on the parameter you are passing to the function,hence, it runs forever and eventually crashes.

Related

Function pointer performance; slower on a single call than multiple calls?

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.
#include <chrono>
#include <iostream>
using namespace std;
__attribute__((noinline)) int plus_one(int x) {
return x + 1;
}
typedef int (*FUNC)(int);
#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);
auto constexpr g_count = 1000000;
__attribute__((noinline)) int speed_test_no_param() {
int r;
auto local_lambda = [](int a) {
return plus_one(a);
};
FUNC f = local_lambda;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_no_param");
return r;
}
__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
int r;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_with_param");
return r;
}
int main() {
int ret = 0;
auto main_lambda = [](int a) {
return plus_one(a);
};
ret += speed_test_no_param();
FUNC fp = main_lambda;
ret += speed_test_with_param(fp);
return ret;
}
Built on Ubuntu 20.04 with:
g++ -ggdb -ffunction-sections -O3 -std=c++17 -DNDEBUG=1 -DRELEASE=1 -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections -Wl,--start-group speed_test.o -Wl,--rpath='$ORIGIN' -Wl,--end-group
The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:
Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849
When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is:
0x000055555555534b call 0x555555555310 <plus_one(int)>
... whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:
0x000055555555544e call QWORD PTR [rbx]
...
0x0000555555555324 jmp 0x555555555310 <plus_one(int)>
(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)
What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:
Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31
I have also run this many times; the parameter version is always the fastest.
I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.
Can someone please explain?
UPDATE
As per the comment below, ordering matters. When I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!
With multiple loop iterations in the C++ source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.
Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?
Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?
Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.
It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.
You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program
The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.
Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.
An indirect call can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.
So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.
If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.
(Microbenchmarking is hard, especially when you're trying to benchmark a C++ concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the + operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)
You're benchmarking a single iteration, which subjects you to cache effects and other warmup costs. The entire reason we normally run benchmarks several times is to amortize out these kinds of effects.
Caching refers to the memory hierarchy: your actual RAM is significantly slower than your CPU (and disk even more so), so to speed things up your CPU has a cache (often, multiple caches) which stores the most recently accessed bits of memory. The first time you start your program, it will need to be loaded from disk into RAM; thereafter, it will need to be loaded from RAM into the CPU caches. Uncached memory accesses can be orders of magnitudes slower than cached memory accesses. As your program runs, various bits of code and data will be loaded from RAM and cached; hence, subsequent executions of the same bit of code will often be faster than the first execution.
Other effects can include things like lazy dynamic linking and lazy initializations, wherein certain functions will perform extra work the first time they're called (for example, resolving dynamic library loads or initializing static data). These can all contribute to the first iteration being slower than subsequent iterations.
To address these issues, always make sure to run your benchmarks multiple times - and when possible, run your entire benchmark suite a few times in one process and take the lowest (fastest) run.

Segmentation fault when using large numbers in native array [duplicate]

I'm a beginner in C++. Yesterday I read about recursive functions, so I decided to write my own. Here's what I wrote:
int returnZero(int anyNumber) {
if(anyNumber == 0)
return 0;
else {
anyNumber--;
return returnZero(anyNumber);
}
}
When I do this: int zero1 = returnZero(4793);, it causes a stack overflow. However, if I pass the value 4792 as the argument, no overflow occurs.
Any ideas as to why?
Whenever you call a function, including recursively, the return address and often the arguments are pushed onto the call stack. The stack is finite, so if the recursion is too deep you'll eventually run out of stack space.
What surprises me is that it only takes 4793 calls on your machine to overflow the stack. This is a pretty small stack. By way of comparison, running the same code on my computer requires ~100x as many calls before the program crashes.
The size of the stack is configurable. On Unix, the command is ulimit -s.
Given that the function is tail-recursive, some compilers might be able to optimize the recursive call away by turning it into a jump. Some compilers might take your example even further: when asked for maximum optimizations, gcc 4.7.2 transforms the entire function into:
int returnZero(int anyNumber) {
return 0;
}
This requires exactly two assembly instructions:
_returnZero:
xorl %eax, %eax
ret
Pretty neat.
You just hit the call stack's size limit of your system, that's what's happening. For some reason the stack in your system is tiny, a depth of 4793 function calls is rather small.
Your stack is limited in size and so when you make 4793 calls you are hitting the limit while 4792 just comes in under. Each function call will use some space on the stack for house keeping and maybe arguments.
This page gives an example of what a stack looks like during a recursive function call.
My guess is you stack is exactly big enough to fit 4792 entries - today. Tomorrow or the next, that number might be different. Recursive programming can be dangerous and this example illistrates why. We try not to let recursion get this deep or 'bad' things can happen.
Any "boundless" recursion, that is recursive calls that aren't naturally limited to a small(ish) number will have this effect. Exactly where the limit goes depends on the OS, the environment the function is called in (the compiler, which function calls the recursive function, etc, etc).
If you add another variable, say int x[10]; to your function that calls your recursive function, the number needed to crash it will change (probably by about 5 or so).
Compile it with a different compiler (or even different compiler settings, e.g. optimization turned on) and it will probably change again.
Using recursion, you can achieve SuperDigit:
public class SuperDigit
{
static int sum = 0;
int main()
{
int n = 8596854;
cout<<getSum(n);
}
int getSum(int n){
sum=0;
while (n > 0) {
int rem;
rem = n % 10;
sum = sum + rem;
n = n / 10;
getSum(n);
}
return sum;
}
}

Why doesn't my C++ compiler optimize these memory writes away?

I created this program. It does nothing of interest but use processing power.
Looking at the output with objdump -d, I can see the three rand calls and corresponding mov instructions near the end even when compiling with O3 .
Why doesn't the compiler realize that memory isn't going to be used and just replace the bottom half with while(1){}? I'm using gcc, but I'm mostly interested in what is required by the standard.
/*
* Create a program that does nothing except slow down the computer.
*/
#include <cstdlib>
#include <unistd.h>
int getRand(int max) {
return rand() % max;
}
int main() {
for (int thread = 0; thread < 5; thread++) {
fork();
}
int len = 1000;
int *garbage = (int*)malloc(sizeof(int)*len);
for (int x = 0; x < len; x++) {
garbage[x] = x;
}
while (true) {
garbage[getRand(len)] = garbage[getRand(len)] - garbage[getRand(len)];
}
}
Because GCC isn't smart enough to perform this optimization on dynamically allocated memory. However, if you change garbageto be a local array instead, GCC compiles the loop to this:
.L4:
call rand
call rand
call rand
jmp .L4
This just calls rand repeatedly (which is needed because the call has side effects), but optimizes out the reads and writes.
If GCC was even smarter, it could also optimize out the randcalls, because its side effects only affect any later randcalls, and in this case there aren't any. However, this sort of optimization would probably be a waste of compiler writers' time.
It can't, in general, tell that rand() doesn't have observable side-effects here, and it isn't required to remove those calls.
It could remove the writes, but it may be the use of arrays is enough to suppress that.
The standard neither requires nor prohibits what it is doing. As long as the program has the correct observable behaviour any optimisation is purely a quality of implementation matter.
This code causes undefined behaviour because it has an infinite loop with no observable behaviour. Therefore any result is permissible.
In C++14 the text is 1.10/27:
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
[Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. —end note ]
I wouldn't say that rand() counts as an I/O function.
Related question
Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.

Stack overflow caused by recursive function

I'm a beginner in C++. Yesterday I read about recursive functions, so I decided to write my own. Here's what I wrote:
int returnZero(int anyNumber) {
if(anyNumber == 0)
return 0;
else {
anyNumber--;
return returnZero(anyNumber);
}
}
When I do this: int zero1 = returnZero(4793);, it causes a stack overflow. However, if I pass the value 4792 as the argument, no overflow occurs.
Any ideas as to why?
Whenever you call a function, including recursively, the return address and often the arguments are pushed onto the call stack. The stack is finite, so if the recursion is too deep you'll eventually run out of stack space.
What surprises me is that it only takes 4793 calls on your machine to overflow the stack. This is a pretty small stack. By way of comparison, running the same code on my computer requires ~100x as many calls before the program crashes.
The size of the stack is configurable. On Unix, the command is ulimit -s.
Given that the function is tail-recursive, some compilers might be able to optimize the recursive call away by turning it into a jump. Some compilers might take your example even further: when asked for maximum optimizations, gcc 4.7.2 transforms the entire function into:
int returnZero(int anyNumber) {
return 0;
}
This requires exactly two assembly instructions:
_returnZero:
xorl %eax, %eax
ret
Pretty neat.
You just hit the call stack's size limit of your system, that's what's happening. For some reason the stack in your system is tiny, a depth of 4793 function calls is rather small.
Your stack is limited in size and so when you make 4793 calls you are hitting the limit while 4792 just comes in under. Each function call will use some space on the stack for house keeping and maybe arguments.
This page gives an example of what a stack looks like during a recursive function call.
My guess is you stack is exactly big enough to fit 4792 entries - today. Tomorrow or the next, that number might be different. Recursive programming can be dangerous and this example illistrates why. We try not to let recursion get this deep or 'bad' things can happen.
Any "boundless" recursion, that is recursive calls that aren't naturally limited to a small(ish) number will have this effect. Exactly where the limit goes depends on the OS, the environment the function is called in (the compiler, which function calls the recursive function, etc, etc).
If you add another variable, say int x[10]; to your function that calls your recursive function, the number needed to crash it will change (probably by about 5 or so).
Compile it with a different compiler (or even different compiler settings, e.g. optimization turned on) and it will probably change again.
Using recursion, you can achieve SuperDigit:
public class SuperDigit
{
static int sum = 0;
int main()
{
int n = 8596854;
cout<<getSum(n);
}
int getSum(int n){
sum=0;
while (n > 0) {
int rem;
rem = n % 10;
sum = sum + rem;
n = n / 10;
getSum(n);
}
return sum;
}
}

Can C++ compilers optimize "if" statements inside "for" loops?

Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?
Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.
Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.
I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.
I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.
As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();
Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.
It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.