Related
Wondering if the following surprises anyone, as it did me? Alex Allain's article here on using constexpr shows the following factorial example:
constexpr factorial (int n)
{
return n > 0 ? n * factorial( n - 1 ) : 1;
}
And states:
Now you can use factorial(2) and when the compiler sees it, it can
optimize away the call and make the calculation entirely at compile
time.
I tried this in VS2015 in Release mode with full optimizations on (/Ox) and stepped through the code in the debugger viewing the assembly and saw that the factorial calculation was not done at compilation.
Using GCC v5.4.0 with --std=C++14, I must use /O2 or /O3 before the calculation is performed at compile time. I was surprised thought that using just /O the calculation did not occur at compilation time.
Main main question is: Why is VS2015 not performing this calculation at compilation time?
It depends on the context of the function call.
For example, the following obviously could never be calculated at compile time:
int x;
std::cin >> x;
std::cout << factorial(x);
On the other hand, this context would require the answer at compile time:
class Foo {
int x[factorial(4)];
};
constexpr functions are only guaranteed to be evaluated at compile time if they are called from a constexpr context; otherwise it is up to the compiler to choose whether or not to eval at compile time (assuming such an optimization is possible, again, depending on the context).
You have to use it in const expression, as:
constexpr auto res = factorial(2);
else computation can be done at runtime.
constexpr is neither necessary nor sufficient to compile time evaluation of a function.
It's not sufficient, even aside from the fact that the arguments obviously also have to be constant expressions. Even if that is true, a conforming compiler does not have to evaluate it at compile time. It only has to be evaluated at compile time if it is in a constexpr context. Such as, assigning the result of the computation to a constexpr variable, or using the value as an array size, or as a non-type template parameter.
The other point, is that the compiler is completely capable of evaluating things at compile time, even without constexpr. There is a lot of confusion about this, and it's not clear why. compile time evaluation of constexpr functions fundamentally just boils down to constant propagation, and compilers have been doing this optimization since forever: https://godbolt.org/g/Sy214U.
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n-1);
}
int foo() { return factorial(5); }
On gcc 6.3 with O3 (and 14) yields:
foo():
mov eax, 120
ret
In essence, outside of the specific case where you absolutely force compile time evaluation by assigning a constexpr function to another constexpr variable, compile time evaluation has more to do with the quality of your optimizer than the standard.
This question already has answers here:
Is Loop Hoisting still a valid manual optimization for C code?
(8 answers)
Closed 7 years ago.
My question is specifically about the gcc compiler.
Often, inside a loop, I have to use a value returned by a function that is constant during the whole loop.
I would like to know if it's better to prior store this constant return value inside a variable (let's imagine a long loop), or if a compiler like gcc is able to perform some optimizations to cache the constant value, because it would recognize it as constant threw the loop.
For example when I loop over chars in a string, I often write something like that:
bool find_something(string s, char something)
{
size_t sz = s.size();
for (size_t i = 0; i != sz; i++)
if (s[i] == something) return true;
return false;
}
but with a clever compiler, I could use the following (which is shorter and more clear):
bool find_something(string s, char something)
{
for (size_t i = 0; i != s.size(); i++)
if (s[i] == something) return true;
return false;
}
then the compiler could detect that the code inside the loop doesn't perform any change on the string object, and then would build a code that would cache the value returned by s.size(), instead of making a (slower) function call for each iteration.
Is there such optimization with gcc?
Generally there's nothing in your example that makes it impossible for the compiler to move the .size() computation before the loop. And in fact GCC 5.2.0 will produce exactly the same code for both implementations that you showed.
I would however strongly suggest against relying on optimizations like that (in really performance critical code) because a small change somewhere (GCCs optimizer, the implementation details of std::string, ...) could break GCCs ability to do this optimization.
However I don't see a point in writing the more verbose version in the usual 90% of code that's not really performance-critical.
Given a current C++ compiler though I would go for the even more concise:
bool find_something(std::string s, char something)
{
for (ch : s)
if (ch == something) return true;
return false;
}
Which BTW also yields very similar machine code with GCC 5.2.0.
The compiler has to know the object is not modified in a different thread. It can tell that the function won't change if the object doesn't change, but it can't tell that the object won't change from some other stimulus.
The compiler will unwind the call to the member with size if you include some form of whole-program optimization
It depends on whether or not the compiler can identify the function call as constant. Consider the following function that may reside in an external library which cannot be analyzed by the compiler.
int odd_size(string s) {
static int a = 0;
return a++;
}
This function will return distinct values regardless of the input parameter. The compiler can therfore not assume a constant return value even if the passed string object remains constant. No optimization will be applied.
On the other hand, if the compiler detects a constant function call, which may be the case in your example, it probably moves the constant expression out of the loop.
Older versions of gcc had an explicit option -floop-optimize which was responsible for that task. From the gcc-3.4.5 documentation:
-floop-optimize
Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well.
Enabled at levels -O, -O2, -O3, -Os.
I cannot find this option in current versions of gcc, but I'm quite sure that they include this type of optimization as well.
I think I have a serious compiler mistrust. Do branches inside inline functions get optimized out if they have constant results?
For the example function:
#define MODE_FROM_X_TO_Y 0
#define MODE_FROM_Y_TO_X 1
inline void swapValues(int & x, int & y, int mode) {
switch(mode) {
case MODE_FROM_X_TO_Y:
y = x;
break;
case MODE_FROM_Y_TO_X:
x = y;
break;
}
}
Would:
swapValues(n,m,MODE_FROM_X_TO_Y);
be optimized as:
n = m;
First off, it won't even compile (until you add a return type).
Secondly, swap is an awesomely badly chosen name (since it doesn't do a swap, and conflicts with the std::swap name).
Thirdly, head over to http://gcc.godbolt.org/:
Live On Godbolt
Generally speaking, the answers to these questions are compiler dependent.
To get an answer to your question with your code, with your compiler (and compiler version), and compiler settings (e.g. optimisation flags) you will need to examine the code output by the compiler.
Branches in any code - not just within inlined functions - can potentially be "optimized out" if the compiler can detect that the same branch is always followed.
Some modern compilers are also smart enough to not inline a function declared inline if it evaluates another function as a better candidate for inlining. A number of modern compilers can do a better job making such decisions than a typical C/C++ programmer.
I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.
So, to see what would happen, I've created a test project like this:
main.cpp
#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"
int main(int argc, char* argv[])
{
int number = 10; // this will come from input, but let's keep it simple for now
int result;
if (std::string(argv[argc - 1]) == "--noavx")
result = FnNormal(number);
else
{
std::cout << "AVX selected\n";
result = FnAVX(number);
}
std::cout << "Double of " << number << " is " << result << std::endl;
return 0;
}
Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:
fn_normal.cpp
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble(num);
}
fn_avx.cpp
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
And here's the template function definition:
double.h
template<typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.
I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.
Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.
As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?
My compiler is MSVC 2013.
Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.
The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:
objdump -d fn_normal.obj:
...
movss -0x1f5c(%ebp,%eax,4),%xmm0
mulss -0x1f5c(%ebp,%ecx,4),%xmm0
mov -0x1f68(%ebp),%edx
mulss -0x1f5c(%ebp,%edx,4),%xmm0
mov -0x1f68(%ebp),%eax
movss %xmm0,-0xfb4(%ebp,%eax,4)
...
objdump -d fn_avx.obj:
...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...
The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.
The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:
double.h:
template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}
fn_avx.cpp:
#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).
When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in #celtschk's answer).
Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.
I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.
In MSVC++:
template<typename T>
__forceinline int RtDouble(T number) {...}
GCC:
template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}
Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.
I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:
File double.h:
template<bool avx, typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
File fn_normal.cpp:
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble<false>(num);
}
File fn_avx.cpp:
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble<true>(num);
}
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.