Why don't compilers optimize trivial wrapper function pointers?

Why don't compilers optimize trivial wrapper function pointers? - c++

Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.

You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.

You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.

Related

Branches in Inline Functions

I think I have a serious compiler mistrust. Do branches inside inline functions get optimized out if they have constant results?
For the example function:
#define MODE_FROM_X_TO_Y 0
#define MODE_FROM_Y_TO_X 1
inline void swapValues(int & x, int & y, int mode) {
switch(mode) {
case MODE_FROM_X_TO_Y:
y = x;
break;
case MODE_FROM_Y_TO_X:
x = y;
break;
}
}
Would:
swapValues(n,m,MODE_FROM_X_TO_Y);
be optimized as:
n = m;

First off, it won't even compile (until you add a return type).
Secondly, swap is an awesomely badly chosen name (since it doesn't do a swap, and conflicts with the std::swap name).
Thirdly, head over to http://gcc.godbolt.org/:
Live On Godbolt

Generally speaking, the answers to these questions are compiler dependent.
To get an answer to your question with your code, with your compiler (and compiler version), and compiler settings (e.g. optimisation flags) you will need to examine the code output by the compiler.
Branches in any code - not just within inlined functions - can potentially be "optimized out" if the compiler can detect that the same branch is always followed.
Some modern compilers are also smart enough to not inline a function declared inline if it evaluates another function as a better candidate for inlining. A number of modern compilers can do a better job making such decisions than a typical C/C++ programmer.

What, in short words, does the GCC option -fipa-pta do?

According to the GCC manual, the -fipa-pta optimization does:
-fipa-pta: Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive
memory and compile-time usage on large compilation units. It is not
enabled by default at any optimization level.
What I assume is that GCC tries to differentiate mutable and immutable data based on pointers and references used in a procedure. Can someone with more in-depth GCC knowledge explain what -fipa-pta does?

I think the word "interprocedural" is the key here.
I'm not intimately familiar with gcc's optimizer, but I've worked on optimizing compilers before. The following is somewhat speculative; take it with a small grain of salt, or confirm it with someone who knows gcc's internals.
An optimizing compiler typically performs analysis and optimization only within each individual function (or subroutine, or procedure, depending on the language). For example, given code like this contrived example:
double *ptr = ...;
void foo(void) {
...
*ptr = 123.456;
some_other_function();
printf("*ptr = %f\n", *ptr);
}
the optimizer will not be able to determine whether the value of *ptr has been changed by the call to some_other_function().
If interprocedural analysis is enabled, then the optimizer can analyze the behavior of some_other_function(), and it may be able to prove that it can't modify *ptr. Given such analysis, it can determine that the expression *ptr must still evaluate to 123.456, and in principle it could even replace the printf call with puts("ptr = 123.456");.
(In fact, with a small program similar to the above code snippet I got the same generated code with -O3 and -O3 -fipa-pta, so I'm probably missing something.)
Since a typical program contains a large number of functions, with a huge number of possible call sequences, this kind of analysis can be very expensive.

As quoted from this article:
The "-fipa-pta" optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
*x = *y;
}
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return 15;
}
A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);
for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.

Pure/const functions in C++

I'm thinking of using pure/const functions more heavily in my C++ code. (pure/const attribute in GCC)
However, I am curious how strict I should be about it and what could possibly break.
The most obvious case are debug outputs (in whatever form, could be on cout, in some file or in some custom debug class). I probably will have a lot of functions, which don't have any side effects despite this sort of debug output. No matter if the debug output is made or not, this will absolutely have no effect on the rest of my application.
Or another case I'm thinking of is the use of some SmartPointer class which may do some extra stuff in global memory when being in debug mode. If I use such an object in a pure/const function, it does have some slight side effects (in the sense that some memory probably will be different) which should not have any real side effects though (in the sense that the behaviour is in any way different).
Similar also for mutexes and other stuff. I can think of many complex cases where it has some side effects (in the sense of that some memory will be different, maybe even some threads are created, some filesystem manipulation is made, etc) but has no computational difference (all those side effects could very well be left out and I would even prefer that).
So, to summarize, I want to mark functions as pure/const which are not pure/const in a strict sense. An easy example:
int foo(int) __attribute__((const));
int bar(int x) {
int sum = 0;
for(int i = 0; i < 100; ++i)
sum += foo(x);
return sum;
}
int foo_callcounter = 0;
int main() {
cout << "bar 42 = " << bar(42) << endl;
cout << "foo callcounter = " << foo_callcounter << endl;
}
int foo(int x) {
cout << "DEBUG: foo(" << x << ")" << endl;
foo_callcounter++;
return x; // or whatever
}
Note that the function foo is not const in a strict sense. Though, it doesn't matter what foo_callcounter is in the end. It also doesn't matter if the debug statement is not made (in case the function is not called).
I would expect the output:
DEBUG: foo(42)
bar 42 = 4200
foo callcounter = 1
And without optimisation:
DEBUG: foo(42) (100 times)
bar 42 = 4200
foo callcounter = 100
Both cases are totally fine because what only matters for my usecase is the return value of bar(42).
How does it work out in practice? If I mark such functions as pure/const, could it break anything (considering that the code is all correct)?
Note that I know that some compilers might not support this attribute at all. (BTW., I am collecting them here.) I also know how to make use of thes attributes in a way that the code stays portable (via #defines). Also, all compilers which are interesting to me support it in some way; so I don't care about if my code runs slower with compilers which do not.
I also know that the optimised code probably will look different depending on the compiler and even the compiler version.
Very relevant is also this LWN article "Implications of pure and constant functions", especially the "Cheats" chapter. (Thanks ArtemGr for the hint.)

I'm thinking of using pure/const functions more heavily in my C++ code.
That’s a slippery slope. These attributes are non-standard and their benefit is restricted mostly to micro-optimizations.
That’s not a good trade-off. Write clean code instead, don’t apply such micro-optimizations unless you’ve profiled carefully and there’s no way around it. Or not at all.
Notice that in principle these attributes are quite nice because they state implied assumptions of the functions explicitly for both the compiler and the programmer. That’s good. However, there are other methods of making similar assumptions explicit (including documentation). But since these attributes are non-standard, they have no place in normal code. They should be restricted to very judicious use in performance-critical libraries where the author tries to emit best code for every compiler. That is, the writer is aware of the fact that only GCC can use these attributes, and has made different choices for other compilers.

You could definitely break the portability of your code. And why would you want to implement your own smart pointer - learning experience apart? Aren't there enough of them available for you in (near) standard libraries?

I would expect the output:
I would expect the input:
int bar(int x) {
return foo(x) * 100;
}
Your code actually looks strange for me. As a maintainer I would think that either foo actually has side effects or more likely rewrite it immediately to the above function.
How does it work out in practice? If I mark such functions as pure/const, could it break anything (considering that the code is all correct)?
If the code is all correct then no. But the chances that your code is correct are small. If your code is incorrect then this feature can mask out bugs:
int foo(int x) {
globalmutex.lock();
// complicated calculation code
return -1;
// more complicated calculation
globalmutex.unlock();
return x;
}
Now given the bar from above:
int main() {
cout << bar(-1);
}
This terminates with __attribute__((const)) but deadlocks otherwise.
It also highly depends on the implementation. For example:
void f() {
for(;;)
{
globalmutex.unlock();
cout << foo(42) << '\n';
globalmutex.lock();
}
}
Where the compiler should move the call foo(42)? Is it allowed to optimize this code? Not in general! So unless the loop is really trivial you have no benefits of your feature. But if your loop is trivial you can easily optimize it yourself.
EDIT: as Albert requested a less obvious situation, here it comes:
F
or example if you implement operator << for an ostream, you use the ostream::sentry which locks the stream buffer. Suppose you call pure/const f after you released or before you locked it. Someone uses this operator cout << YourType() and f also uses cout << "debug info". According to you the compiler is free to put the invocation of f into the critical section. Deadlock occurs.

I would examine the generated asm to see what difference they make. (My guess would be that switching from C++ streams to something else would yield more of a real benefit, see: http://typethinker.blogspot.com/2010/05/are-c-iostreams-really-slow.html )

I think nobody knows this (with the exception of gcc programmers), simply because you rely on undefined and undocumented behaviour, which can change from version to version. But how about something like this:
#ifdef NDEBUG \
#define safe_pure __attribute__((pure)) \
#else \
#define safe_pure \
#endif
I know it's not exactly what you want, but now you can use the pure attribute without breaking the rules.
If you do want to know the answer, you may ask in the gcc forums (mailing list, whatever), they should be able to give you the exact answer.
Meaning of the code: When NDEBUG (symbol used in assert macros) is defined, we don't debug, have no side effects, can use pure attribute. When it is defined, we have side effects, so it won't use pure attribute.

Performance implications of &p[0] vs. p.get() in boost::scoped_array

The topic generically says it all. Basically in a situation like this:
boost::scoped_array<int> p(new int[10]);
Is there any appreciable difference in performance between doing: &p[0] and p.get()?
I ask because I prefer the first one, it has a more natural pointer like syntax. In fact, it makes it so you could replace p with a native pointer or array and not have to change anything else.
I am guessing since get is a one liner "return ptr;" that the compiler will inline that, and I hope that it is smart enough to to inline operator[] in such a way that it is able to not dereference and then immediately reference.
Anyone know?

The only way to know is to actually measure it!
But if you have the source of the boost:scoped_array you could llok at the code and see what it does. I am sure it is pretty similar.
T * scoped_array::get() const // never throws
{
return ptr;
}
T & scoped_array::operator[](std::ptrdiff_t i) const // never throws
{
BOOST_ASSERT(ptr != 0);
BOOST_ASSERT(i >= 0);
return ptr[i];
}
Write two versions of the code (one using get() the other using operator[]). Compile to assembley with optimizations turned on. See if your compiler actually manages to optimize away the ptr+0.

OK, I've done some basic tests as per Martin York's suggestions.
It seems that g++ (4.3.2) is actually pretty good about this. At both -O2 and -O3 optimization levels, it outputs slightly different but functionally equivalent assembly for both &p[0] and p.get().
At -Os as expected, it took the path of least complexity and emits a call to the operator[]. One thing to note is that the &p[0] version does cause g++ to emit a copy of the operator[] body, but it is never used, so there is a slight code bloat if you never use operator[] otherwise:
The tested code was this (with the #if both 0 and 1):
#include <boost/scoped_array.hpp>
#include <cstdio>
int main() {
boost::scoped_array<int> p(new int[10]);
#if 1
printf("%p\n", &p[0]);
#else
printf("%p\n", p.get());
#endif
}

Is this a question you're asking just for academic interest or is this for some current code you're writing?
In general, people recommend that code clarity is more important that speed, so unless your sure this is going to make a difference you should pick whichever option is clearer and matches your code base. FWIW, I personally think that the get() is clearer.

Which, if any, C++ compilers do tail-recursion optimization?

It seems to me that it would work perfectly well to do tail-recursion optimization in both C and C++, yet while debugging I never seem to see a frame stack that indicates this optimization. That is kind of good, because the stack tells me how deep the recursion is. However, the optimization would be kind of nice as well.
Do any C++ compilers do this optimization? Why? Why not?
How do I go about telling the compiler to do it?
For MSVC: /O2 or /Ox
For GCC: -O2 or -O3
How about checking if the compiler has done this in a certain case?
For MSVC, enable PDB output to be able to trace the code, then inspect the code
For GCC..?
I'd still take suggestions for how to determine if a certain function is optimized like this by the compiler (even though I find it reassuring that Konrad tells me to assume it)
It is always possible to check if the compiler does this at all by making an infinite recursion and checking if it results in an infinite loop or a stack overflow (I did this with GCC and found out that -O2 is sufficient), but I want to be able to check a certain function that I know will terminate anyway. I'd love to have an easy way of checking this :)
After some testing, I discovered that destructors ruin the possibility of making this optimization. It can sometimes be worth it to change the scoping of certain variables and temporaries to make sure they go out of scope before the return-statement starts.
If any destructor needs to be run after the tail-call, the tail-call optimization can not be done.

All current mainstream compilers perform tail call optimisation fairly well (and have done for more than a decade), even for mutually recursive calls such as:
int bar(int, int);
int foo(int n, int acc) {
return (n == 0) ? acc : bar(n - 1, acc + 2);
}
int bar(int n, int acc) {
return (n == 0) ? acc : foo(n - 1, acc + 1);
}
Letting the compiler do the optimisation is straightforward: Just switch on optimisation for speed:
For MSVC, use /O2 or /Ox.
For GCC, Clang and ICC, use -O3
An easy way to check if the compiler did the optimisation is to perform a call that would otherwise result in a stack overflow — or looking at the assembly output.
As an interesting historical note, tail call optimisation for C was added to the GCC in the course of a diploma thesis by Mark Probst. The thesis describes some interesting caveats in the implementation. It's worth reading.

As well as the obvious (compilers don't do this sort of optimization unless you ask for it), there is a complexity about tail-call optimization in C++: destructors.
Given something like:
int fn(int j, int i)
{
if (i <= 0) return j;
Funky cls(j,i);
return fn(j, i-1);
}
The compiler can't (in general) tail-call optimize this because it needs
to call the destructor of cls after the recursive call returns.
Sometimes the compiler can see that the destructor has no externally visible side effects (so it can be done early), but often it can't.
A particularly common form of this is where Funky is actually a std::vector or similar.

gcc 4.3.2 completely inlines this function (crappy/trivial atoi() implementation) into main(). Optimization level is -O1. I notice if I play around with it (even changing it from static to extern, the tail recursion goes away pretty fast, so I wouldn't depend on it for program correctness.
#include <stdio.h>
static int atoi(const char *str, int n)
{
if (str == 0 || *str == 0)
return n;
return atoi(str+1, n*10 + *str-'0');
}
int main(int argc, char **argv)
{
for (int i = 1; i != argc; ++i)
printf("%s -> %d\n", argv[i], atoi(argv[i], 0));
return 0;
}

Most compilers don't do any kind of optimisation in a debug build.
If using VC, try a release build with PDB info turned on - this will let you trace through the optimised app and you should hopefully see what you want then. Note, however, that debugging and tracing an optimised build will jump you around all over the place, and often you cannot inspect variables directly as they only ever end up in registers or get optimised away entirely. It's an "interesting" experience...

As Greg mentions, compilers won't do it in debug mode. It's ok for debug builds to be slower than a prod build, but they shouldn't crash more often: and if you depend on a tail call optimization, they may do exactly that. Because of this it is often best to rewrite the tail call as an normal loop. :-(

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js