why is this factorial recursive function inefficient? - c++

In the book "Think like a programmer", the following recursive function is said to be "highly inefficient" and I can't figure out why (the book does not explain). It doesn't seem like there are any unnecessary calculations being done. Is it because of the overhead of calling so many functions (well the same function multiple times) and thus setting up environments for each call to the function?
int factorial(int n) {
if (n == 1) return 1;
else return n * factorial(n-1);
}

It is inefficient in two ways, and you hit one of them:
It is recursive, instead of iterative. This will be highly inefficient if tail-call optimization is not enabled. (To learn more about tail-call optimization, look here.) It could have been done like this:
int factorial(int n)
{
int result = 1;
while (n > 0)
{
result *= n;
n--;
}
return result;
}
or, alternatively, with a for loop.
However, as noted in comments above, it doesn't really matter how efficient it is if an int can't even hold the result. It really should be longs, long longs, or even big-ints.
The second inefficiency is simply not having an efficient algorithm. This list of factorial algorithms shows some more efficient ways of computing the factorial by decreasing the number of numerical operations.

There is significant function call overhead in C when not using a compiler that implements tail call optimization.
Function call overhead is the extra time and memory necessary for a computer to properly set up a function call.
Tail call optimization is a method of turning recursive functions like the one given into a loop.

I think the book writer may want to tell readers to not abuse recursion. For this function you could just use:
int factorial(int n) {
int res = 1;
for (i = 1; i <= n; i++) {
res = res * i;
}
return res;
}

Recursion is slower as well as memory eater in terms of Memory Stack.It is a time taking work to push info onto the stack and again to pop it .The main advantage of recursion is that it makes the algorithm a little easier to understand or more "elegant".
For finding the factorial we can use For loop that will be good in terms of memory as well as Time Complexity.
int num=4;
int fact = 1;
for (;num>1;num--)
{
fact = fact*num;
}
//display fact

Related

In recursive DP, break up recursion call by storing variables: inefficient?

Suppose I am solving a dynamic programming problem recursively (top down). For example, a recursive solution to the longest common subsequence problem:
LCS(S,n,T,m)
{
if (n==0 || m==0) return 0;
if (S[n] == T[m]) result = 1 + LCS(S,n-1,T,m-1);
else result = max( LCS(S,n-1,T,m), LCS(S,n,T,m-1) );
return result;
}
Often in such a DP problem at some point we have to take the max of some expressions, representing returns to different choices we can make. In the above case we have the max of two simple expressions, but in worse cases it can be the max of three or four quite complicated expressions involving long function calls. In such situations, I am often tempted to give these complicated expressions their own variable names, to make the code more readable. In the above case that would mean I would write
LCS(S,n,T,m)
{
if (n==0 || m==0) return 0;
if (S[n] == T[m]) result = 1 + LCS(S,n-1,T,m-1);
else
a = LCS(S,n-1,T,m);
b = LCS(S, n, T, m-1);
result = max(a, b);
return result;
}
(In this simplified case a and b are not complicated, but in other cases they are, and there may be even more arguments to the max function, so this could really help it be more understandable.)
My Question: Is this a terrible idea? As I understand it, I'm adding a variable to each layer of the call stack, and I'm thinking that could be wasteful. But on the other hand, at each layer it has to calculate the temporary variable LCS(S,n,T,m) anyway (I'm thinking in terms of C++, say), and as far as I know, there might be not much difference in cost between the two ways.
If this is a terrible idea, is there a more efficient way to break up a complicated recursive function call to make it more readable?
C++ has the "As-If" rule, which states that a compiler can do whatever it wants so long as the observable effects are indistinguishable from what is defined by the standard to happen. In this case, it's trivial to prove both fragments have the same meaning, and a compiler will likely emit identical instructions for both.
Note: You aren't doing dynamic programming here, as you don't memoise parameter / result pairs.

Tail-recursion with objects

I have a recursive function that I would like to make tail-recursive. My actual problem is more complex and context-dependent. But the issue I would like to solve is demonstrated with this simple program:
#include <iostream>
struct obj
{
int n;
operator int&() { return n; }
};
int tail(obj n)
{
return tail(obj{ n + 1 > 1000 ? n - 1000 : n + 1 });
}
int main()
{
tail(obj{ 1 });
}
It seems natural that this is tail-recursive. It is not, though, because the destructor of obj n has to be called each time. At least MSVC13 (edit:) and MSVC15 do not optimize this. If I replace obj with int and change the calls accordingly, it becomes tail-recursive as expected.
My actual question is: Is there an easy way to make this tail-recursive apart from just replacing obj with int? I am aiming for performance benefits, so playing around with heap-allocated memory and new is most likely not helpful.
Short Answer: No.
Longer Answer: You might find a way to achieve this but certainly no easy one.
Since tail call optimization is not required by the standard, you can never know for sure if some minor change to your program will make the compiler fail to optimize the code.
Worse, consider what happens when you need to debug your program. The compiler will almost certainly not optimize advanced tail calls with debugger flags, which means that your program will only work correctly in release mode. This will make the program much harder to maintain.
Alternative to tail recursion
Just write a loop. It can always be done and it is likely to be much, much less convoluted. It also doesn't use the heap, so the overhead will be much smaller.
Since you use a temporary, I assume you don't need the object after the recursive call.
One fairly hackish solution is to allocate an object, pass a pointer to it, and reallocate it before making the recursive call, to which you pass the object you newly constructed.
struct obj
{
int n;
operator int&() { return n; }
};
int tail_impl(obj*& n)
{
int n1 = *n + 1 > 1000 ? *n - 1000 : *n + 1;
delete n;
n = new obj{n1};
return tail_impl(n);
}
int tail(obj n)
{
obj *n1 = new obj{n};
auto ret = tail_impl(n1);
delete n1;
return ret;
}
int main()
{
tail(obj{ 1 });
}
I've obviously omitted some crucial exception safety details. However GCC is able to turn tail_impl into a loop, since it is indeed tail recursion.

Is it possible to micro-optimize "x = max(a,b); y = min(a,b);"?

I had an algorithm that started out like
int sumLargest2 ( int * arr, size_t n )
{
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
// ...
and I realized that the first is probably not optimal because calling max and then min is repetitious when you consider that the information required to know the minimum is already there once you've found the maximum. So I figured out that I could do
int largest = max(arr[0], arr[1]);
int secondLargest = arr[0] == largest ? arr[1] : arr[0];
to shave off the useless invocation of min, but I'm not sure that actually saves any number of operations. Are there any fancy bit-shifting algorithms that can do the equivalent of
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
?????
In C++, you can use std::minmax to produce a std::pair of the minimum and the maximum. This is particularly easy in combination with std::tie:
#include <algorithm>
#include <utility>
int largest, secondLargest;
std::tie(secondLargest, largest) = std::minmax(arr[0], arr[1]);
GCC, at least, is capable of optimizing the call to minmax into a single comparison, identical to the result of the C code below.
In C, you could write the test out yourself:
int largest, secondLargest;
if (arr[0] < arr[1]) {
largest = arr[1];
secondLargest = arr[0];
} else {
largest = arr[0];
secondLargest = arr[1];
}
How about:
int largestIndex = arr[1] > arr[0];
int largest = arr[largestIndex];
int secondLargest = arr[1 - largestIndex];
The first line relies on an implicit cast of a boolean result to 1 in the case of true and 0 in the case of false.
I'm going to assume that you'd rather solve the larger problem... That is, getting the sum of the largest two numbers in an array.
What you are trying to do is a std::partial_sort().
Let's implement it.
int sumLargest2(int * arr, size_t n) {
int * first = arr;
int * middle = arr + 2;
int * last = arr + n;
std::partial_sort(first, middle, last, std::greater<int>());
return arr[0] + arr[1];
}
And if you're unable to modify arr, then I'd recommend looking into std::partial_sort_copy().
x = max(a, b);
y = a + b - x;
It won't necessarily be faster, but it will be different.
Also beware of overflows.
If your intention is to reduce the function call to find min mad max you can try std::minmax_element. This is available since C++11.
auto result = std::minmax_element(arr, arr+n);
std::cout<< "min:"<< *result.first<<"\n";
std::cout<< "max :" <<*result.second << "\n";
If you just want to find the bigger of two values go:
if(a > b)
{
largest = a;
second = b;
}
else
{
largest = b;
second = a;
}
No function calls, one comparison, two assignments.
I'm assuming C++...
Short answer, use std::minmax and compile with the right optimizations and the right instruction set parameters.
Long ugly answer, The compiler cannot make all the assumptions necessary to make it really, really fast. You can. In this case, you can change the algorithm to process all data first and you can force alignment on the data. Doing all this, you can use intrinsics to make it faster.
Although I haven't tested it in this particular case, I've seen enormous performance improvements using these guidelines.
Since you're not passing 2 integers to the function, I'm assuming your using an array and want to iterate it somehow. You now have a choice to make: make 2 arrays and use min/max or use 1 array with both a and b. This decision alone can already influence the performance.
If you have 2 arrays, these can be allocated on 32-byte boundaries with aligned malloc's and then processed using intrinsics. If you are going for real, raw performance - this is the way to go.
F.ex, let's assume you have AVX2. (NOTE: I'm not sure if you do and you SHOULD check this using CPU id's!). Go to the cheat sheet here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and pick your poison.
The intrinsics you're looking for are in this case probably:
_mm256_min_epi32
_mm256_max_epi32
_mm256_stream_load_si256
If you have to do this for the entire array, you probably want to keep all the stuff in a single __mm256 register before merging the individual items. E.g.: do a min/max per 256-bit vector, and when the loop is done, extract the 32-bit items and do a min/max on that.
Long nicer answer: So ... as for the compiler. Compilers do attempt to optimize these kinds of things, but run into problems.
If you have 2 different arrays that you process, the compiler has to know that they are different in order to be able to optimize it. This is the reason why stuff like restrict exists, which tells the compiler exactly this little thing you probably already knew while writing the code.
Also, the compiler doesn't know your memory is aligned, so it has to check this and branch... for each call. We don't want this; which means we want it to inline its stuff. So, add inline, put it in a header file and that's that. You can also use aligned to give him a hint.
Your compiler also didn't get the hint that the int* won't change over time. If it cannot change, it's a good idea to tell him that using the const keyword.
A compiler uses an instruction set to do the compilation. Normally, they already use SSE, but AVX2 can help a lot (as I've shown with the intrinsics above). If you can compile it with those flags, make sure to use them - they help a lot.
Run in release mode, compile with optimizations on 'fast' and see what happens under the hood. If you do all this, you should see vpmax... instructions appearing in the inner loops, which means that the compiler uses the intrinsics just fine.
I don't know what else you want to do in the loop... if you use all these instructions you should hit the memory speed on big arrays.
How about a time-space trade-off?
#include <utility>
template<typename T>
std::pair<T, T>
minmax(T const& a, T const& b)
{ return b < a ? std::make_pair(b, a) : std::make_pair(a, b); }
//main
std::pair<int, int> values = minmax(a[0], a[1]);
int largest = values.second;
int secondLargest = values.first;

Safe and fast FFT

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?
I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

Tail recursion in C++

Can someone show me a simple tail-recursive function in C++?
Why is tail recursion better, if it even is?
What other kinds of recursion are there besides tail recursion?
A simple tail recursive function:
unsigned int f( unsigned int a ) {
if ( a == 0 ) {
return a;
}
return f( a - 1 ); // tail recursion
}
Tail recursion is basically when:
there is only a single recursive call
that call is the last statement in the function
And it's not "better", except in the sense that a good compiler can remove the recursion, transforming it into a loop. This may be faster and will certainly save on stack usage. The GCC compiler can do this optimisation.
Tail recusion in C++ looks the same as C or any other language.
void countdown( int count ) {
if ( count ) return countdown( count - 1 );
}
Tail recursion (and tail calling in general) requires clearing the caller's stack frame before executing the tail call. To the programmer, tail recursion is similar to a loop, with return reduced to working like goto first_line;. The compiler needs to detect what you are doing, though, and if it doesn't, there will still be an additional stack frame. Most compilers support it, but writing a loop or goto is usually easier and less risky.
Non-recursive tail calls can enable random branching (like goto to the first line of some other function), which is a more unique facility.
Note that in C++, there cannot be any object with a nontrivial destructor in the scope of the return statement. The end-of-function cleanup would require the callee to return back to the caller, eliminating the tail call.
Also note (in any language) that tail recursion requires the entire state of the algorithm to be passed through the function argument list at each step. (This is clear from the requirement that the function's stack frame be eliminated before the next call begins… you can't be saving any data in local variables.) Furthermore, no operation can be applied to the function's return value before it's tail-returned.
int factorial( int n, int acc = 1 ) {
if ( n == 0 ) return acc;
else return factorial( n-1, acc * n );
}
Tail recursion is a special case of a tail call. A tail call is where the compiler can see that there are no operations that need to be done upon return from a called function -- essentially turning the called function's return into it's own. The compiler can often do a few stack fix-up operations and then jump (rather than call) to the address of the first instruction of the called function.
One of the great things about this besides eliminating some return calls is that you also cut down on stack usage. On some platforms or in OS code the stack can be quite limited and on advanced machines like the x86 CPUs in our desktops decreasing the stack usage like this will improve data cache performance.
Tail recursion is where the called function is the same as the calling function. This can be turned into loops, which is exactly the same as the jump in the tail call optimization mentioned above. Since this is the same function (callee and caller) there are fewer stack fixups that need to be done before the jump.
The following shows a common way to do a recursive call which would be more difficult for a compiler to turn into a loop:
int sum(int a[], unsigned len) {
if (len==0) {
return 0;
}
return a[0] + sum(a+1,len-1);
}
This is simple enough that many compilers could probably figure it out anyway, but as you can see there is an addition that needs to happen after the return from the called sum returns a number, so a simple tail call optimization is not possible.
If you did:
static int sum_helper(int acc, unsigned len, int a[]) {
if (len == 0) {
return acc;
}
return sum_helper(acc+a[0], len-1, a+1);
}
int sum(int a[], unsigned len) {
return sum_helper(0, len, a);
}
You would be able to take advantage of the calls in both functions being tail calls. Here the sum function's main job is to move a value and clear a register or stack position. The sum_helper does all of the math.
Since you mentioned C++ in your question I'll mention some special things about that.
C++ hides some things from you which C does not. Of these destructors are the main thing that will get in the way of tail call optimization.
int boo(yin * x, yang *y) {
dharma z = x->foo() + y->bar();
return z.baz();
}
In this example the call to baz is not really a tail call because z needs to be destructed after the return from baz. I believe that the rules of C++ may make the optimization more difficult even in cases where the variable is not needed for the duration of the call, such as:
int boo(yin * x, yang *y) {
dharma z = x->foo() + y->bar();
int u = z.baz();
return qwerty(u);
}
z may have to be destructed after the return from qwerty here.
Another thing would be implicit type conversion, which can happen in C as well, but can more complicated and common in C++.
For instance:
static double sum_helper(double acc, unsigned len, double a[]) {
if (len == 0) {
return acc;
}
return sum_helper(acc+a[0], len-1, a+1);
}
int sum(double a[], unsigned len) {
return sum_helper(0.0, len, a);
}
Here sum's call to sum_helper is not a tail call because sum_helper returns a double and sum will need to convert that into an int.
In C++ it is quite common to return an object reference which may have all kinds of different interpretations, each of which could be a different type conversion,
For instance:
bool write_it(int it) {
return cout << it;
}
Here there is a call made to cout.operator<< as the last statement. cout will return a reference to itself (which is why you can string lots of things together in a list separated by << ), which you then force to be evaluated as a bool, which ends up calling another of cout's methods, operator bool(). This cout.operator bool() could be called as a tail call in this case, but operator<< could not.
EDIT:
One thing that is worth mentioning is that a major reason that tail call optimization in C is possible is that the compiler knows that the called function will store it's return value in the same place as the calling function would have to ensure that its return value is stored in.
Tail recursion is a trick to actually cope with two issues at the same time. The first is executing a loop when it is hard to know the number of iterations to do.
Though this can be worked out with simple recursion, the second problem arises which is that of stack overflow due to the recursive call being executed too many times. The tail call is the solution, when accompanied by a "compute and carry" technique.
In basic CS you learn that a computer algorithm needs to have an invariant and a termination condition. This is the base for building the tail recursion.
All computation happens in the argument passing.
All results must be passed onto function calls.
The tail call is the last call, and occurs at termination.
To simply put it, no computation must happen on the return value of your function .
Take for example the computation of a power of 10, which is trivial and can be written by a loop.
Should look something like
template<typename T> T pow10(T const p, T const res =1)
{
return p ? res: pow10(--p,10*res);
}
This gives an execution, e.g 4:
ret,p,res
-,4,1
-,3,10
-,2,100
-,1,1000
-,0,10000
10000,-,-
It is clear that the compiler just has to copy values without changing the stack pointer and when the tail call happens just to return the result.
Tail recursion is very important because it can provide ready made compile time evaluations, e.g. The above can be made to be.
template<int N,int R=1> struct powc10
{
int operator()() const
{
return powc10<N-1, 10*R>()();
}
};
template<int R> struct powc10<0,R>
{
int operator()() const
{
return R;
}
};
this can be used as powc10<10>()() to compute the 10th power at compile time.
Most compilers have a limit of nested calls so the tail call trick helps. Evidently,there are no meta programming loops, so have to use recursion.
Tail recursion does not exist really at compiler level in C++.
Although you can write programs that use tail recursion, you do not get the inherit benefits of tail recursion implemented by supporting compilers/interpreters/languages. For instance Scheme supports a tail recursion optimization so that it basically will change recursion into iteration. This makes it faster and invulnerable to stack overflows. C++ does not have such a thing. (least not any compiler I've seen)
Apparently tail recursion optimizations exist in both MSVC++ and GCC. See this question for details.
Wikipedia has a decent article on tail recursion. Basically, tail recursion is better than regular recursion because it's trivial to optimize it into an iterative loop, and iterative loops are generally more efficient than recursive function calls. This is particularly important in functional languages where you don't have loops.
For C++, it's still good if you can write your recursive loops with tail recursion since they can be better optimized, but in such cases, you can generally just do it iteratively in the first place, so the gain is not as great as it would be in a functional language.