Related
In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.
For constexpr functions the only option is to have recursive functions for anything but simple things. The problem with that is that recursive functions are expensive at run time (especially if you are going to be calling yourself a lot of times).
So is it possible to implement 2 functions one for constexpr and the other for normal use:
constexpr int fact(int x){ //Use this at compile time
return x == 0 ? 1 : fact(x-1)*x;
}
int fact(int x){ //Use this for real calls
int ret = 1;
for (int i = 1; i < x+1; i++){
ret *= i;
}
return ret;
}
And along the same lines can you make a special function for inline situations also?
Since C++14, the loop form is a valid constexpr as per (http://en.cppreference.com/w/cpp/language/constexpr), so the second form with constexpr added is valid.
Unfortunately not all compilers support this (The latest version of Visual C++ doesn't, but the latest Clang and GCC ones apparently do (but I am unable to test this)).
In which case you can either:
Rely on the compilers optimizations, and use the first version (you might want to test this for your specific compiler)
Give the two forms different names (such as fact_const for the constexpr function, and make sure you only use the constexpr version when it's arguments are also constexpr (I don't know how to actually check whether this is the case)
Wait till your compiler releases an update that supports this.
I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.
So, to see what would happen, I've created a test project like this:
main.cpp
#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"
int main(int argc, char* argv[])
{
int number = 10; // this will come from input, but let's keep it simple for now
int result;
if (std::string(argv[argc - 1]) == "--noavx")
result = FnNormal(number);
else
{
std::cout << "AVX selected\n";
result = FnAVX(number);
}
std::cout << "Double of " << number << " is " << result << std::endl;
return 0;
}
Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:
fn_normal.cpp
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble(num);
}
fn_avx.cpp
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
And here's the template function definition:
double.h
template<typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.
I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.
Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.
As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?
My compiler is MSVC 2013.
Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.
The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:
objdump -d fn_normal.obj:
...
movss -0x1f5c(%ebp,%eax,4),%xmm0
mulss -0x1f5c(%ebp,%ecx,4),%xmm0
mov -0x1f68(%ebp),%edx
mulss -0x1f5c(%ebp,%edx,4),%xmm0
mov -0x1f68(%ebp),%eax
movss %xmm0,-0xfb4(%ebp,%eax,4)
...
objdump -d fn_avx.obj:
...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...
The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.
The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:
double.h:
template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}
fn_avx.cpp:
#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).
When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in #celtschk's answer).
Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.
I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.
In MSVC++:
template<typename T>
__forceinline int RtDouble(T number) {...}
GCC:
template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}
Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.
I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:
File double.h:
template<bool avx, typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
File fn_normal.cpp:
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble<false>(num);
}
File fn_avx.cpp:
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble<true>(num);
}
I'm was messing around with tail-recursive functions in C++, and I've run into a bit of a snag with the g++ compiler.
The following code results in a stack overflow when numbers[] is over a couple hundred integers in size. Examining the assembly code generated by g++ for the following reveals that twoSum_Helper is executing a recursive call instruction to itself.
The question is which of the following is causing this?
A mistake in the following that I am overlooking which prevents tail-recursion.
A mistake with my usage of g++.
A flaw in the detection of tail-recursive functions within the g++ compiler.
I am compiling with g++ -O3 -Wall -fno-stack-protector test.c on Windows Vista x64 via MinGW with g++ 4.5.0.
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2);
else
return twoSum_Helper(numbers, size, target, i, j + 1);
}
Tail call optimization in C or C++ is extremely limited, and pretty much a lost cause. The reason is that there generally is no safe way to tail-call from a function that passes a pointer or reference to any local variable (as an argument to the call in question, or in fact any other call in the same function) -- which of course is happening all over the place in C/C++ land, and is almost impossible to live without.
The problem you are seeing is probably related: GCC likely compiles returning a struct by actually passing the address of a hidden variable allocated on the caller's stack into which the callee copies it -- which makes it fall into the above scenario.
Try compilling with -O2 instead of -O3.
How do I check if gcc is performing tail-recursion optimization?
well, it doesn't work with O2 anyway. The only thing that seems to work is returning the result object into a reference that is given as a parameter.
but really, it's much easier to just remove the Tail call and use a loop instead. TCO is here to optimize tail call that are found when inlining or when performing agressive unrolling, but you shouldn't attempt to use recursion when handling large values anyway.
I can't get g++ 4.4.0 (under mingw) to perform tail recursion, even on this simple function:
static void f (int x)
{
if (x == 0) return ;
printf ("%p\n", &x) ; // or cout in C++, if you prefer
f (x - 1) ;
}
I've tried -O3, -O2, -fno-stack-protector, C and C++ variants. No tail recursion.
I would look at 2 things.
The return call in the if statement is going to have a branch target for the else in the stack frame for the current run of the function that needs to be resolved post call (which would mean any TCO attempt would not be able overwrite the executing stack frame thus negating the TCO)
The numbers[] array argument is a variable length data structure which could also prevent TCO because in TCO the same stack frame is used in one way or another. If the call is self referencing (like yours) then it will overwrite the stack defined variables (or locally defined) with the values/references of the new call. If the tail call is to another function then it will overwrite the entire stack frame with the new function (in a case where TCO can be done in A => B => C, TCO could make this look like A => C in memory during execution). I would try a pointer.
It has been a couple months since I have built anything in C++ so I didn't run any tests, but I think one/both of those are preventing the optimization.
Try changing your code to:
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i, int j)
{
if (numbers[i] + numbers[j] == target)
return gen_Result(i, j, true);
if (i >= (size - 1))
return gen_Result(i, j, false);
if(j >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i, i + 1);
}
edit: removed unnecessary parameter as per comment from asker
// Return 2 indexes from numbers that sum up to target.
struct result twoSum_Helper(int numbers[], int size, int target, int i)
{
if (numbers[i] + numbers[i+1] == target || i >= (size - 1))
return gen_Result(i, i+1, true);
if(i+1 >= size)
i++; //call by value, changing i here does not matter
return twoSum_Helper(numbers, size, target, i);
}
Support of Tail Call Optimization (TCO) is limited in C/C++.
So, if the code relies on TCO to avoid stack overflow it may be better to rewrite it with a loop. Otherwise some auto test is needed to be sure that the code is optimized.
Typically TCO may be suppressed by:
passing pointers to objects on stack of recursive function to external functions (in case of C++ also passing such object by reference);
local object with non-trivial destructor even if the tail recursion is valid (the destructor is called before the tail return statement), for example Why isn't g++ tail call optimizing while gcc is?
Here TCO is confused by returning structure by value.
It can be fixed if the result of all recursive calls will be written to the same memory address allocated in other function twoSum (similarly to the answer https://stackoverflow.com/a/30090390/4023446 to Tail-recursion not happening)
struct result
{
int i;
int j;
bool found;
};
struct result gen_Result(int i, int j, bool found)
{
struct result r;
r.i = i;
r.j = j;
r.found = found;
return r;
}
struct result* twoSum_Helper(int numbers[], int size, int target,
int i, int j, struct result* res_)
{
if (i >= (size - 1)) {
*res_ = gen_Result(i, j, false);
return res_;
}
if (numbers[i] + numbers[j] == target) {
*res_ = gen_Result(i, j, true);
return res_;
}
if (j >= size)
return twoSum_Helper(numbers, size, target, i + 1, i + 2, res_);
else
return twoSum_Helper(numbers, size, target, i, j + 1, res_);
}
// Return 2 indexes from numbers that sum up to target.
struct result twoSum(int numbers[], int size, int target)
{
struct result r;
return *twoSum_Helper(numbers, size, target, 0, 1, &r);
}
The value of res_ pointer is constant for all recursive calls of twoSum_Helper.
It can be seen in the assembly output (the -S flag) that the twoSum_Helper tail recursion is optimized as a loop even with two recursive exit points.
Compile options: g++ -O2 -S (g++ version 4.7.2).
I have heard others complain, that tail recursion is only optimized with gcc and not g++.
Could you try using gcc.
Since the code of twoSum_Helper is calling itself it shouldn't come as a surprise that the assembly shows exactly that happening. That's the whole point of a recursion :-) So this hasn't got anything to do with g++.
Every recursion creates a new stack frame, and stack space is limited by default. You can increase the stack size (don't know how to do that on Windows, on UNIX the ulimit command is used), but that only defers the crash.
The real solution is to get rid of the recursion. See for example this question and this question.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.