It seems to me that it would work perfectly well to do tail-recursion optimization in both C and C++, yet while debugging I never seem to see a frame stack that indicates this optimization. That is kind of good, because the stack tells me how deep the recursion is. However, the optimization would be kind of nice as well.
Do any C++ compilers do this optimization? Why? Why not?
How do I go about telling the compiler to do it?
For MSVC: /O2 or /Ox
For GCC: -O2 or -O3
How about checking if the compiler has done this in a certain case?
For MSVC, enable PDB output to be able to trace the code, then inspect the code
For GCC..?
I'd still take suggestions for how to determine if a certain function is optimized like this by the compiler (even though I find it reassuring that Konrad tells me to assume it)
It is always possible to check if the compiler does this at all by making an infinite recursion and checking if it results in an infinite loop or a stack overflow (I did this with GCC and found out that -O2 is sufficient), but I want to be able to check a certain function that I know will terminate anyway. I'd love to have an easy way of checking this :)
After some testing, I discovered that destructors ruin the possibility of making this optimization. It can sometimes be worth it to change the scoping of certain variables and temporaries to make sure they go out of scope before the return-statement starts.
If any destructor needs to be run after the tail-call, the tail-call optimization can not be done.
All current mainstream compilers perform tail call optimisation fairly well (and have done for more than a decade), even for mutually recursive calls such as:
int bar(int, int);
int foo(int n, int acc) {
return (n == 0) ? acc : bar(n - 1, acc + 2);
}
int bar(int n, int acc) {
return (n == 0) ? acc : foo(n - 1, acc + 1);
}
Letting the compiler do the optimisation is straightforward: Just switch on optimisation for speed:
For MSVC, use /O2 or /Ox.
For GCC, Clang and ICC, use -O3
An easy way to check if the compiler did the optimisation is to perform a call that would otherwise result in a stack overflow — or looking at the assembly output.
As an interesting historical note, tail call optimisation for C was added to the GCC in the course of a diploma thesis by Mark Probst. The thesis describes some interesting caveats in the implementation. It's worth reading.
As well as the obvious (compilers don't do this sort of optimization unless you ask for it), there is a complexity about tail-call optimization in C++: destructors.
Given something like:
int fn(int j, int i)
{
if (i <= 0) return j;
Funky cls(j,i);
return fn(j, i-1);
}
The compiler can't (in general) tail-call optimize this because it needs
to call the destructor of cls after the recursive call returns.
Sometimes the compiler can see that the destructor has no externally visible side effects (so it can be done early), but often it can't.
A particularly common form of this is where Funky is actually a std::vector or similar.
gcc 4.3.2 completely inlines this function (crappy/trivial atoi() implementation) into main(). Optimization level is -O1. I notice if I play around with it (even changing it from static to extern, the tail recursion goes away pretty fast, so I wouldn't depend on it for program correctness.
#include <stdio.h>
static int atoi(const char *str, int n)
{
if (str == 0 || *str == 0)
return n;
return atoi(str+1, n*10 + *str-'0');
}
int main(int argc, char **argv)
{
for (int i = 1; i != argc; ++i)
printf("%s -> %d\n", argv[i], atoi(argv[i], 0));
return 0;
}
Most compilers don't do any kind of optimisation in a debug build.
If using VC, try a release build with PDB info turned on - this will let you trace through the optimised app and you should hopefully see what you want then. Note, however, that debugging and tracing an optimised build will jump you around all over the place, and often you cannot inspect variables directly as they only ever end up in registers or get optimised away entirely. It's an "interesting" experience...
As Greg mentions, compilers won't do it in debug mode. It's ok for debug builds to be slower than a prod build, but they shouldn't crash more often: and if you depend on a tail call optimization, they may do exactly that. Because of this it is often best to rewrite the tail call as an normal loop. :-(
Related
Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.
You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.
You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.
I'm looking for a compiler flag that will allow me to prevent the compiler optimising away the loop in code like this:
void func() {
std::unique_ptr<int> up1(new int(0)), up2;
up2 = move(up1);
for(int i = 0; i < 1000000000; i++) {
if(up2) {
*up2 += 1;
}
}
if(up2)
printf("%d", *up2);
}
in both C++ and Rust code. I'm trying to compare similar sections of code in terms of speed and running this loop rather than just evaluating the overall result is important. Since Rust statically guarantees that the pointer ownership hasn't been moved, it doesn't need the null pointer checks on each iteration of the loop and I would imagine therefore it would produce faster code if the loop couldn't be optimised out for whatever reason.
Rust compiles using an LLVM backend, so I would preferably be using that for C++ as well.
In Rust you can use test::black_box.
In C++ (using gcc) asm volatile("" : "+r" (datum));. See this.
One typical way to avoid having the compiler optimize away loops is to make their bounds indeterminate at compile time. In this example, rather than looping up to 10000000, loop up to a count which is read from stdin or argv.
This question already has answers here:
Is Loop Hoisting still a valid manual optimization for C code?
(8 answers)
Closed 7 years ago.
My question is specifically about the gcc compiler.
Often, inside a loop, I have to use a value returned by a function that is constant during the whole loop.
I would like to know if it's better to prior store this constant return value inside a variable (let's imagine a long loop), or if a compiler like gcc is able to perform some optimizations to cache the constant value, because it would recognize it as constant threw the loop.
For example when I loop over chars in a string, I often write something like that:
bool find_something(string s, char something)
{
size_t sz = s.size();
for (size_t i = 0; i != sz; i++)
if (s[i] == something) return true;
return false;
}
but with a clever compiler, I could use the following (which is shorter and more clear):
bool find_something(string s, char something)
{
for (size_t i = 0; i != s.size(); i++)
if (s[i] == something) return true;
return false;
}
then the compiler could detect that the code inside the loop doesn't perform any change on the string object, and then would build a code that would cache the value returned by s.size(), instead of making a (slower) function call for each iteration.
Is there such optimization with gcc?
Generally there's nothing in your example that makes it impossible for the compiler to move the .size() computation before the loop. And in fact GCC 5.2.0 will produce exactly the same code for both implementations that you showed.
I would however strongly suggest against relying on optimizations like that (in really performance critical code) because a small change somewhere (GCCs optimizer, the implementation details of std::string, ...) could break GCCs ability to do this optimization.
However I don't see a point in writing the more verbose version in the usual 90% of code that's not really performance-critical.
Given a current C++ compiler though I would go for the even more concise:
bool find_something(std::string s, char something)
{
for (ch : s)
if (ch == something) return true;
return false;
}
Which BTW also yields very similar machine code with GCC 5.2.0.
The compiler has to know the object is not modified in a different thread. It can tell that the function won't change if the object doesn't change, but it can't tell that the object won't change from some other stimulus.
The compiler will unwind the call to the member with size if you include some form of whole-program optimization
It depends on whether or not the compiler can identify the function call as constant. Consider the following function that may reside in an external library which cannot be analyzed by the compiler.
int odd_size(string s) {
static int a = 0;
return a++;
}
This function will return distinct values regardless of the input parameter. The compiler can therfore not assume a constant return value even if the passed string object remains constant. No optimization will be applied.
On the other hand, if the compiler detects a constant function call, which may be the case in your example, it probably moves the constant expression out of the loop.
Older versions of gcc had an explicit option -floop-optimize which was responsible for that task. From the gcc-3.4.5 documentation:
-floop-optimize
Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well.
Enabled at levels -O, -O2, -O3, -Os.
I cannot find this option in current versions of gcc, but I'm quite sure that they include this type of optimization as well.
I think I have a serious compiler mistrust. Do branches inside inline functions get optimized out if they have constant results?
For the example function:
#define MODE_FROM_X_TO_Y 0
#define MODE_FROM_Y_TO_X 1
inline void swapValues(int & x, int & y, int mode) {
switch(mode) {
case MODE_FROM_X_TO_Y:
y = x;
break;
case MODE_FROM_Y_TO_X:
x = y;
break;
}
}
Would:
swapValues(n,m,MODE_FROM_X_TO_Y);
be optimized as:
n = m;
First off, it won't even compile (until you add a return type).
Secondly, swap is an awesomely badly chosen name (since it doesn't do a swap, and conflicts with the std::swap name).
Thirdly, head over to http://gcc.godbolt.org/:
Live On Godbolt
Generally speaking, the answers to these questions are compiler dependent.
To get an answer to your question with your code, with your compiler (and compiler version), and compiler settings (e.g. optimisation flags) you will need to examine the code output by the compiler.
Branches in any code - not just within inlined functions - can potentially be "optimized out" if the compiler can detect that the same branch is always followed.
Some modern compilers are also smart enough to not inline a function declared inline if it evaluates another function as a better candidate for inlining. A number of modern compilers can do a better job making such decisions than a typical C/C++ programmer.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.