Tail call optimisation seems to slightly worsen performance

Tail call optimisation seems to slightly worsen performance - c++

In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.
size of array with TLO(ms) without TLO(ms)
8M 35,083 34,051
4M 8,952 8,627
1M 613 609
Below is my code:
#include <bits/stdc++.h>
using namespace std;
int N = 4000000;
void qsort(int* arr,int start=0,int finish=N-1){
if(start>=finish) return ;
int i=start+1,j = finish,temp;
auto pivot = arr[start];
while(i!=j){
while (arr[j]>=pivot && j>i) --j;
while (arr[i]<pivot && i<j) ++i;
if(i==j) break;
temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
}
if(arr[i]>=arr[start]) --i;
temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
qsort(arr,start,i-1);
qsort(arr,i+1,finish);
}
int main(){
srand(time(NULL));
int* arr = new int[N];
for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}
auto start = clock();
qsort(arr);
cout<<(clock()-start)<<endl;
return 0;
}
I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.
EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?

On speed:
As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.
However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.
If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.
On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.
The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.
So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.
I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.
There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):
//fib.cpp
#include <iostream>
unsigned long long int fib(unsigned long long int n){
if (n==0 || n==1)
return 1;
return fib(n-1)+fib(n-2);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fib(N);
}
running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.
On tail call optimization:
Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form
ResType f( InputType input){
//do work
InputType new_input = ...;
return f(new_input);
}
There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.
Let's take look at this simple and standard implementation of tail call recursion:
//should be called fac(n,1)
unsigned long long int
fac(unsigned long long int n, unsigned long long int res_so_far){
if (n==0)
return res_so_far;
return fac(n-1, res_so_far*n);
}
This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!
However, the gcc compiler is able to perform the TCO also for less obvious cases:
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!
At hand of this example, we can see the result of TCO at work:
//factorial.cpp
#include <iostream>
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fac(N);
}
Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.
Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.
Corollary:
As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.
However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.
That means you should check for which part of the array you call the quicksort first as it plays a huge role.

Related

What is the maximum amount of iterations in this loop in order to avoid crash?

What is the maximum number of iterations in this case when n decreases by one everytime and the number is huge
#include <bits/stdc++.h>
using namespace std;
int main() {
unsigned long long n = 1000000000000000000;
while(n)
n--;
}

The C++ standard does not specify a maximum for loops that don't do anything (no input or output etc. etc.). The as-if rule for compilation permits the compiler to optimise the result to
unsigned long long n = 0;
So the maximum will be zero for any sensible compiler with appropriate compilation optimisations set.
Reference: https://en.cppreference.com/w/cpp/language/as_if
In your case, a compiler with optimisations set will reduce your entire program to
int main(){
return 0;
}
(Note that the main has an implicit return 0; on any control path if not given by the programmer.)

The maximum number of iterations will be the size of n if it's reduced by 1 in every iteration. If it is reduced by 2 then it will be iterate n/2 times it depends on how you're reducing the size of n at run time.

There is no "iteration limit" in C++. Furthermore, your loop is likely completely removed. Here's what GCC generates:
main:
xor eax, eax
ret
As you can see, no looping is done at all. It simply returns zero.

There is no any limitation on iteration for loops. For example your loop can be infinite.
But there is limitation with stack size, so cannot create infinite recursion. Loops don't create any new stack frame.
Most of GUI application are working in the infinite loops.

Hash function: Is there a way to optimize my code further?

Above is the hash function.
I wrote the code below. I am not sure if I can use another clever way to make this more efficient. I am using the understanding that I do not need to do the mod at all since unsigned int takes care of that through overflow.
int myHash(string s)
{
unsigned int hash = 0;
long long int multiplier = 1;
for(int i = s.size()-1;i>-1;i--)
{
hash += (multiplier * s[i]);
multiplier *= 31;
}
return hash;
}

I would avoid using long long for multiplier. At least if you don't know 100% that your processor does 64-bit multiplies in the same amount of time as a 32-bit multiply. Really modern top of the range processors probably do, older & smaller processors almost certainly take longer to do 64-bit mul operations than 32-bit ones.
Multiplying by 31 can actually be quite fast even on processors that aren't good at multiplying, because x *= 31 can be converted to x = x * 32 - x; or x = (x << 5) - x; - in fact it may be worth trying that [if you haven't compiled the code to assembler and seen that the compiler already does that].
Beyond that, it would be processor or compiler-specific optimisations that comes to mind. Loop unrolling for example. Or using inline assembler or intrinsics to make use of vector instructions (subject to availability for different processor architectures and different generations). Modern compilers like recent versions of gcc or clang will probably vectorize this code, subject to being given the "right" options.
As with all optimisation projects, measure the time, using a representative workload, keep records of what you changed. Look at the generated code, try to figure out if there's a better way to do it. And don't lose track of the fact that it's the OVERALL program's performance that matter. If you spend 80% of the time in this function, by all means, optimize the heck out of it. If you spend 20% of the time, optimize it a bit, if you spend 2% of the time in it, unless there's OBVIOUS things you can do to improve it, it's not going to give you much. I've seen the results of people writing code to save a few clock-cycles in some code that takes several million cycles in the loop two lines further on. And using bit-fiddling tricks to save 2 bytes in something that takes half a megabyte. It just creates mess, not really worth doing.

I guess you could make the argument not have to copy the string for the function call, make s const string &s instead, or use std::string_view if you happen to be using C++17. Otherwise it looks fast the the point where you should leave the rest to the compiler. Try making it optimize with -O2 or your compilers equivalent.

Let me preface this by saying it's probably not worth doing -- it's unlikely that your hash function is going to be the bottleneck in your program, so making the hash function more elaborate in an attempt to make it more efficient will probably just make it harder to understand and maintain while not making your program measurably faster. So don't do this unless you've actually determined that your program spends a significant percentage of its time computing string hashes, and make sure you have a good benchmark routine that you can run "before" and "after" this change to verify that it actually did speed things up significantly, otherwise you might just be chasing rainbows.
That said, one potential way to hash long strings more quickly would be to process the string a word at a time rather than a character at a time, something like this:
unsigned int aSlightlyFasterHash(const string & s)
{
const unsigned int numWordsInString = s.size()/sizeof(unsigned int);
const unsigned int numExtraBytesInString = s.size()%sizeof(unsigned int);
// Compute the bulk of the hash by reading the string a word at a time
unsigned int hash = 0;
const unsigned int * iptr = reinterpret_cast<const unsigned int *>(s.c_str());
for (unsigned int i=0; i<numWordsInString; i++)
{
hash += *iptr;
iptr++;
}
// Then any "leftover" bytes at the end we will mix in to the hash the old way
const unsigned char * cptr = reinterpret_cast<const unsigned char *>(iptr);
unsigned int multiplier = 1;
for(unsigned int i=0; i<numExtraBytesInString; i++)
{
hash += (multiplier * *cptr);
cptr++;
multiplier *= 31;
}
return hash;
}
Note that the above function will return different hash values than the hash function you provided.
That cuts down on the number of loop iterations by a factor of four; of course it's likely that the execution of the function is limited by RAM bandwidth rather than CPU cycles anyway, so do be too surprised if this doesn't go noticeably faster on a modern CPU. If RAM bandwidth is indeed the bottleneck, then there's not too much you can do about it, since you have to read the contents of the string in order to compute a hash code for the string; there's no getting around that (except perhaps by precomputing the hash code in advance and storing it somewhere, but that only works if you know all the strings you are going to use in advance).

Why do I have to turn on optimization in g++ for simple array access?

I have written a simple Gaussian elimination algorithm using a std::vector of doubles in C++ (gcc / Linux). Now I have seen that the runtime depends on the optimization level of the compiler (up to 5-times faster with -O3). I wrote a small test program and received similar results. The problem is not the allocation of the vector nor any resizing etc.
It's the simple fact that the statement:
v[i] = x + y / z;
(or something like that) is much slower without optimization. I think the problem is the index operator. Without compiler optimization, the std::vector is slower than a raw double *v, but when I turn on optimization, the performance is equal and, to my surprise, even the access to the raw double *v is faster.
Is there an explanation for this behaviour? I'm really not a professional developer, but I thought the compiler should be able to transfer statements like the above one rather directly to hardware instructions. Why is there a need to turn on an optimization and, more importantly, what is the disadvantage of the optimization? (If there is none, I wonder why the optimization is not the standard.)
Here is my vector test code:
const long int count = 100000;
const double pi = 3.1416;
void C_array (long int size)
{
long int start = time(0);
double *x = (double*) malloc (size * sizeof(double));
for (long int n = 0; n < count; n++)
for (long int i = 0; i < size; i++)
x[i] = i;
//x[i] = pi * (i-n);
printf ("C array : %li s\n", time(0) - start);
free (x);
}
void CPP_vector (long int size)
{
long int start = time(0);
std::vector<double> x(size);
for (long int n = 0; n < count; n++)
for (long int i = 0; i < size; i++)
x[i] = i;
//x[i] = pi * (i-n);
printf ("C++ vector: %li s\n", time(0) - start);
}
int main ()
{
printf ("Size of vector: ");
long int size;
scanf ("%li", &size);
C_array (size);
CPP_vector (size);
return 0;
}
I received some weird results. A standard g++ compile produces a runtime 8 s (C array) or 18 s (std::vector) for a vector size of 20 000. If I use the more complex line behind the //.., the runtime is 8 / 15 s (yes, faster). If I turn on -O3 then, the runtime is 5 / 5 s for a 40,000 vector size.

Why do we want optimization/debug releases ?
Optimization may completely reorder the sequence of instructions, eliminate variables, inline functions calls and make the executable code so far away of the source code that you cannot debug it. So, one of the reason for not using optimization is to keep the possibility to debug the code. When your code is (when you believe your code is) fully debugged, you can turn on optimization to produce a release build.
Why is the debug code slow ?
One thing to keep in mind is that a debug version of the STL may contain additional checks for boundaries and validity of iterators. This can slow down the code by a factor of 10. This is known to be an issue with the Visual C++ STL, but in your case you are not using it. I don't know the state of the art of the gcc's STL.
Another possibility is that you are accessing the memory in a non linear sequence, producing lots of cache misses. When in debug mode, the compiler will respsect your code and produce this inefficient code. But when optimization is on, it may rewrite your accesses to be sequential and not produce any cache miss.
What to do ?
You could try to show a simple compilable example exhibiting the behavior. We could then compile and look at the assembly to explain what is really going on. The size of the data you're processing is important if you hit a cache issue.
Links
Visual C++ STL is slow in debug mode: http://marknelson.us/2011/11/28/vc-10-hash-table-performance-problems/
What does the debug version of the STL do with Visual C++: http://channel9.msdn.com/Series/C9-Lectures-Stephan-T-Lavavej-Advanced-STL/C9-Lectures-Stephan-T-Lavavej-Advanced-STL-3-of-n
Cache miss and their impact: http://channel9.msdn.com/Events/Build/2014/2-661 , specially from 29'27"
Cache again: https://www.youtube.com/watch?v=fHNmRkzxHWs at 36'34"

How can the compile-time be (exponentially) faster than run-time?

The below code calculates Fibonacci numbers by an exponentially slow algorithm:
#include <cstdlib>
#include <iostream>
#define DEBUG(var) { std::cout << #var << ": " << (var) << std::endl; }
constexpr auto fib(const size_t n) -> long long
{
return n < 2 ? 1: fib(n - 1) + fib(n - 2);
}
int main(int argc, char *argv[])
{
const long long fib91 = fib(91);
DEBUG( fib91 );
DEBUG( fib(45) );
return EXIT_SUCCESS;
}
And I am calculating the 45th Fibonacci number at run-time, and the 91st one at compile time.
The interesting fact is that GCC 4.9 compiles the code and computes fib91 in a fraction of a second, but it takes a while to spit out fib(45).
My question: If GCC is smart enough to optimize fib(91) computation and not to take the exponentially slow path, what stops it to do the same for fib(45)?
Does the above mean GCC produces two compiled versions of fib function where one is fast and the other exponentially slow?
The question is not how the compiler optimizes fib(91) calculation (yes! It does use a sort of memoization), but if it knows how to optimize the fib function, why does it not do the same for fib(45)? And, are there two separate compilations of the fib function? One slow, and the other fast?

GCC is likely memoizing constexpr functions (enabling a Θ(n) computation of fib(n)). That is safe for the compiler to do because constexpr functions are purely functional.
Compare the Θ(n) "compiler algorithm" (using memoization) to your Θ(φn) run time algorithm (where φ is the golden ratio) and suddenly it makes perfect sense that the compiler is so much faster.
From the constexpr page on cppreference (emphasis added):
The constexpr specifier declares that it is possible to evaluate the value of the function or variable at compile time.
The constexpr specifier does not declare that it is required to evaluate the value of the function or variable at compile time. So one can only guess what heuristics GCC is using to choose whether to evaluate at compile time or run time when a compile time computation is not required by language rules. It can choose either, on a case-by-case basis, and still be correct.
If you want to force the compiler to evaluate your constexpr function at compile time, here's a simple trick that will do it.
constexpr auto compute_fib(const size_t n) -> long long
{
return n < 2 ? n : compute_fib(n - 1) + compute_fib(n - 2);
}
template <std::size_t N>
struct fib
{
static_assert(N >= 0, "N must be nonnegative.");
static const long long value = compute_fib(N);
};
In the rest of your code you can then access fib<45>::value or fib<91>::value with the guarantee that they'll be evaluated at compile time.

At compile-time the compiler can memoize the result of the function. This is safe, because the function is a constexpr and hence will always return the same result of the same inputs.
At run-time it could in theory do the same. However most C++ programmers would frown at optimization passes that result in hidden memory allocations.

When you ask for fib(91) to give a value to your const fib91 in the source code, the compiler is forced to compute that value from you const expr. It does not compile the function (as you seem to think), just it sees that to compute fib91 it needs fib(90) and fib(89), to compute the it needs fib(87)... so on until he computes fib(1) which is given. This is an $O(n)$ algorithm and the result is computed fast enough.
However when you ask to evaluate fib(45) in runtime the compiler has to choose wether using the actual function call or precompute the result. Eventually it decides to use the compiled function. Now, the compiled function must execute exactly the exponential algorithm that you have decided there is no way the compiler could implement memoization to optimize a recursive function (think about the need to allocate some cache and to understand how many values to keep and how to manage them between function calls).

Function pointer runs faster than inline function. Why?

I ran a benchmark of mine on my computer (Intel i3-3220 # 3.3GHz, Fedora 18), and got very unexpected results. A function pointer was actually a bit faster than an inline function.
Code:
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
std::chrono::duration<double> t;
int total=0;
for(int i=0;i<10000000;i++)
{
auto begin=std::chrono::high_resolution_clock::now();
short a=toBigEndian((short)i);//toBigEndianPtr((short)i);
total+=a;
auto end=std::chrono::high_resolution_clock::now();
t+=std::chrono::duration_cast<std::chrono::duration<double>>(end-begin);
}
std::cout<<t.count()<<", "<<total<<std::endl;
return 0;
}
compiled with
g++ test.cpp -std=c++0x -O0
The 'toBigEndian' loop finishes always at around 0.26-0.27 seconds, while 'toBigEndianPtr' takes 0.21-0.22 seconds.
What makes this even more odd is that when I remove 'total', the function pointer becomes the slower one at 0.35-0.37 seconds, while the inline function is at about 0.27-0.28 seconds.
My question is:
Why is the function pointer faster than the inline function when 'total' exists?

Short answer: it isn't.
You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.
So, to give your measurements any meaning,
optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of

A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).
Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.
Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.
Modified code follows:
#include <iostream>
static inline uint64_t rdtsc(void)
{
uint32_t hi, lo;
asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
uint64_t t = 0, begin=0, end=0;
int total=0;
begin=rdtsc();
for(int i=0;i<LOOP_COUNT;i++)
{
short a=0;
a=toBigEndianPtr((short)i);
//a=toBigEndian((short)i);
total+=a;
}
end=rdtsc();
t+=(end-begin);
std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
return 0;
}

Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
int total=0;
auto begin=std::chrono::high_resolution_clock::now();
for(int i=0;i<100000000;i++)
{
short a=toBigEndianPtr((short)i);
total+=a;
}
auto end=std::chrono::high_resolution_clock::now();
std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
return 0;
}
the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.

First off, with -O0, you aren't running the optimizer, which means the compiler is ignoring your request to inline, as it is free to do. The cost of the two different calls ought to be nearly identical. Try with -O2.
Second, if you are only running for 0.22 seconds, weirdly variable costs involved with starting your program totally dominate the cost of running the test function. That function call is just a few instructions. If your CPU is running at 2 GHz, it ought to execute that function call in something like 20 nanoseconds, so you can see that whatever it is you're measuring, it's not the cost of running that function.
Try calling the test function in a loop, say 1,000,000 times. Make the number of loops 10x bigger until it takes > 10 seconds to run the test. Then divide the result by the number of loops for an approximation of the cost of the operation.

With many/most self-respecting modern compilers, the code you posted will still inline the function call even when when it is called through the pointer. (Assuming the compiler makes a reasonable effort to optimize the code). The situation is just too easy to see through. In other words, the generated code can easily end up virtually the same in both cases, meaning that your test is not really useful for measuring what you are trying to measure.
If you really want to make sure the call is physically performed through the pointer, you have to make an effort to "confuse" the compiler to the point where it can't figure out the pointer value at compile time. For example, make the pointer value run-time dependent, as in
toBigEndianPtr = rand() % 1000 != 0 ? toBigEndian : NULL;
or something along these lines. You can also declare your function pointer as volatile, which will typically cause a genuine through-the-pointer call each time as well as force the compiler to re-read the pointer value from memory on each iteration.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js