I made this small program to test, if gfortran does tail call elimination:
program tailrec
implicit none
print *, tailrecsum(5, 0)
contains
recursive function tailrecsum (x, running_total) result (ret_val)
integer, intent(in) :: x
integer, intent(in) :: running_total
integer :: ret_val
if (x == 0) then
ret_val = running_total
return
end if
ret_val = tailrecsum (x-1, running_total + x)
end function tailrecsum
end program
To check, I compiled it with the -S option, to take a look at the instructions. Here the lines for the tailrecsum function:
tailrecsum.3429:
.LFB1:
.cfi_startproc
movl (%rdi), %eax
testl %eax, %eax
jne .L2
movl (%rsi), %eax
ret
.p2align 4,,10
.p2align 3
.L2:
subq $24, %rsp
.cfi_def_cfa_offset 32
leal -1(%rax), %edx
addl (%rsi), %eax
leaq 8(%rsp), %rdi
leaq 12(%rsp), %rsi
movl %edx, 8(%rsp)
movl %eax, 12(%rsp)
call tailrecsum.3429
addq $24, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
At the end, I see call tailrecsum.3429 and therefore, think that there is no tail call elimination. This is the same also when I use -O2 or -O3 and -foptimize-sibling-calls.
So, does gfortran not support this or is it a problem of my code?
It does support it. It is quite tricky to avoid many very subtle traps which harm the tail call optimization.
It becomes simpler for the compiler to optimize tail calls if you pass the arguments by value. In that case there is no temporary to which the receiving procedure needs to have a pointer (address).
In fact, this modification is enough to get the tail call elimination and enable unlimited recursion:
recursive function tailrecsum (x, running_total) result (ret_val) bind(C)
integer, value :: x
integer, value :: running_total
integer :: ret_val
if (x == 0) then
ret_val = running_total
return
end if
ret_val = tailrecsum (x-1, running_total + x)
end function tailrecsum
Gfortran does not require the bind(C) because it implements all value as C-like pass by value. Intel does require it because it creates a temporary and passes its address.
The details may differ on different architectures, depending on who is responsible for the cleanup of what.
Consider this version:
program tailrec
use iso_fortran_env
implicit none
integer(int64) :: acc, x
acc = 0
x = 500000000
call tailrecsum(x, acc)
print *, acc
contains
recursive subroutine tailrecsum (x, running_total)
integer(int64), intent(inout) :: x
integer(int64), intent(inout) :: running_total
integer(int64) :: ret_val
if (x == 0) return
running_total = running_total + x
x = x - 1
call tailrecsum (x, running_total)
end subroutine tailrecsum
end program
With 500000000 iterations it would clearly blow the stack without TCO, but it does not:
> gfortran -O2 -frecursive tailrec.f90
> ./a.out
125000000250000000
You can examine what the compiler does more easily using -fdump-tree-optimized. Honestly, I didn't even bother trying to understand your assembly output. X86 assembly is simply too esoteric for me, my simple brain can handle only certain RISCs.
You can see that there is still a lot going on after the call to the next iteration in your original version:
<bb 6>:
_25 = _5 + -3;
D.1931 = _25;
_27 = _18 + _20;
D.1930 = _27;
ret_val_28 = tailrecsum (&D.1931, &D.1930);
D.1930 ={v} {CLOBBER};
D.1931 ={v} {CLOBBER};
<bb 7>:
# _29 = PHI <_20(5), ret_val_28(6)>
<bb 8>:
# _22 = PHI <_11(4), _29(7)>
<bb 9>:
# _1 = PHI <ret_val_7(3), _22(8)>
return _1;
}
I am not an expert in GIMPLE, but the D.193x operations are definitely linked to the temporary expressions that are put on the stack for the call.
The PHI operations then find which version of the return value will be actually returned based on which branch was actually taken in the if statement (https://gcc.gnu.org/onlinedocs/gccint/SSA.html).
As I said it is sometimes tricky to simplify your code to the right form which is acceptable for gfortran to perform the tail call optimization.
Related
I have a function GetObj(int), which take a param means vector index, here's my c++ code
std::shared_ptr<Obj> GetObj(int obj_id) {
// obj_buffer_ is array, 2-buffers,
// declare as std::vector<std::shared_ptr<Obj>> obj_buffer_[2];
// cur_buffer_ is a atomic<size_t> var, value may be 0 or 1
auto& vec_obj = obj_buffer_[cur_buffer_.load(std::memory_order_acquire)];
// let's assume that obj_id is never illegal in running time, so program never return nullptr
if (obj_id < 0 || obj_id >= vec_obj.size()) {
return nullptr;
}
// vec_obj[obj_id] is a shared_ptr<Obj>,
return vec_obj[obj_id];
}
I will call this function 5 million times per second, and use "perf top" to see which instruction cost most, here is the result
please ignor function & vars name(GetShard is GetObj), I change the func & var's name make it easier to understand
here is my Question, let's pay attention to the last instruction
│ ↓ je 6f
0.12 │ lock addl $0x1,0x8(%rdx) ............... (1)
91.85 │ ┌──jmp 73 ............... (2)
│6f:│ addl $0x1,0x8(%rdx) ............... (3)
0.12 │73:└─→pop %rbp ............... (4)
│ ← retq ............... (5)
I found the (1) instruction, which is shared_ptr's counter incr 1, the atomic counter incr will lock the cpu bus, then jump to (4) pop %rbp which means pop stack. these instruction is doing "return" job.
but why does jmp instruction took 90% cpu circles? the function is called 5 millions times/s, I can understand if "lock addl" slow, but why jmp instruction take so long?
I have the following while-loop
uint32_t x = 0;
while(x*x < STOP_CONDITION) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
The STOP_CONDITION is constant at run-time, but not at compile time. Is there are more efficient way to maintain x*x or do I really need to recompute it every time?
Note: According to the benchmark below, this code runs about 1 -- 2% slower than this option. Please read the disclaimer included at the bottom!
In addition to Tamas Ionut's answer, if you want to maintain STOP_CONDITION as the actual stop condition and avoid the square root calculation, you could update the square using the mathematical identity
(x + 1)² = x² + 2x + 1
whenever you change x:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
Since the 2*x + 1 is just a bit shift and an increment, the compiler should be able to optimize this fairly well.
Disclaimer: Since you asked "how can I optimize this code" I answered with one particular way to possibly make it faster. Whether the double + increment is actually faster than a single integer multiplication should be tested in practice. Whether you should optimize the code is a different question. I assume you have already benchmarked the loop and found it to be a bottleneck, or that you have a theoretical interest in the question. If you are writing production code that you wish to optimize, first measure the performance and then optimize where needed (which is probably not the x*x in this loop).
What about:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
This way, you're getting rid of that extra computation.
I made a small benchmarking for Tamas Ionut and CompuChip answers and here are the results:
Tamas Ionut: 19.7068
The code of this method:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
CompuChip: 20.2056
The code of this method:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
with STOP_CONDITION = 1000000 and repeating the process 1000000 times
Environment:
Compiler : MSVC 2013
OS : Windows 8.1 - X64
Processor: Core i7-4510U
#2.00 GHZ
Release Mode - Maximize Speed (/O2)
I would say, optimization in readibility is better than optimization in Performance in your case since we are talking about a very small Performance optimization
The compliter can optimize a lot for you regarding Performance but readibility lies in the responsibility of the programmer
I believe Tamas Ionut solution is better than that of CompuChip because we only have x++ inside the for loop. However, a comparison between uint32_t and double will kill the deal. It would be more efficient if we use uint32_t for bound instead of using double. This approach has less problem with numerical overflow because x cannot be greater than 2^16 = 65536 if we want to have a correct x^2 value.
If we also do a heavy work in the loop then results obtained from both approach should be very similar, however, Tamas Ionut approach is more simple and easier to read.
Below is my code and the corresponding assembly code obtained using clang version 3.8.0 with -O3 flag. It is very clear from the assembly code that the first approach is more efficient.
using T = size_t;
void test1(const T stopCondition, bool checkCondition) {
T x = 0;
while (x < stopCondition) {
if (checkCondition) {
x++;
}
// Do something heavy here
}
}
void test2(const T stopCondition, bool checkCondition) {
T x = 0;
T xSquare = 0;
const T threshold = stopCondition * stopCondition;
while (xSquare < threshold) {
if (checkCondition) {
xSquare += 2 * x + 1;
x++;
}
// Do something heavy here
}
}
(gdb) disassemble test1
Dump of assembler code for function _Z5test1mb:
0x0000000000400be0 <+0>: movzbl %sil,%eax
0x0000000000400be4 <+4>: mov %rax,%rcx
0x0000000000400be7 <+7>: neg %rcx
0x0000000000400bea <+10>: nopw 0x0(%rax,%rax,1)
0x0000000000400bf0 <+16>: add %rax,%rcx
0x0000000000400bf3 <+19>: cmp %rdi,%rcx
0x0000000000400bf6 <+22>: jb 0x400bf0 <_Z5test1mb+16>
0x0000000000400bf8 <+24>: retq
End of assembler dump.
(gdb) disassemble test2
Dump of assembler code for function _Z5test2mb:
0x0000000000400c00 <+0>: imul %rdi,%rdi
0x0000000000400c04 <+4>: test %sil,%sil
0x0000000000400c07 <+7>: je 0x400c2e <_Z5test2mb+46>
0x0000000000400c09 <+9>: xor %eax,%eax
0x0000000000400c0b <+11>: mov $0x1,%ecx
0x0000000000400c10 <+16>: test %rdi,%rdi
0x0000000000400c13 <+19>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c15 <+21>: data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c20 <+32>: add %rcx,%rax
0x0000000000400c23 <+35>: add $0x2,%rcx
0x0000000000400c27 <+39>: cmp %rdi,%rax
0x0000000000400c2a <+42>: jb 0x400c20 <_Z5test2mb+32>
0x0000000000400c2c <+44>: jmp 0x400c42 <_Z5test2mb+66>
0x0000000000400c2e <+46>: test %rdi,%rdi
0x0000000000400c31 <+49>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c33 <+51>: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c40 <+64>: jmp 0x400c40 <_Z5test2mb+64>
0x0000000000400c42 <+66>: retq
End of assembler dump.
Suppose we have the following (nonsensical) code:
const int a = 0;
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
Variable 'a' equals zero, so the compiler can deduce on compile time, that the instruction 'if(a) c++;' will never be executed and will optimize it away.
My question: Does the same happen with lambda closures?
Check out another piece of code:
const int a = 0;
function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
}
Will the compiler know that 'a' is 0 and will it optimize the lambda?
Even more sophisticated example:
function<int()> generate_lambda(const int a)
{
return [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
}
function<int()> a_is_zero = generate_lambda(0);
function<int()> a_is_one = generate_lambda(1);
Will the compiler be smart enough to optimize the first lambda when it knows that 'a' is 0 at generation time?
Does gcc or llvm have this kind of optimizations?
I'm asking because I wonder if I should make such optimizations manually when I know that certain assumptions are satisfied on lambda generation time or the compiler will do that for me.
Looking at the assembly generated by gcc5.2 -O2 shows that the optimization does not happen when using std::function:
#include <functional>
int main()
{
const int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to some boilerplate and
movl (%rdi), %ecx
movl $10000000, %edx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
cmpl $1, %ecx
sbbl $-1, %eax
addl $7, %eax
subl $1, %edx
jne .L3
rep; ret
which is the loop you wanted to see optimized away. (Live) But if you actually use a lambda (and not an std::function), the optimization does happen:
int main()
{
const int a = 0;
auto lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to
movl $70000000, %eax
ret
i.e. the loop was removed completely. (Live)
Afaik, you can expect a lambda to have zero overhead, but std::function is different and comes with a cost (at least at the current state of the optimizers, although people apparently work on this), even if the code "inside the std::function" would have been optimized. (Take that with a grain of salt and try if in doubt, since this will probably vary between compilers and versions. std::functions overhead can certainly be optimized away.)
As #MarcGlisse correctly pointed out, clang3.6 performs the desired optimization (equivalent to the second case above) even with std::function. (Live)
Bonus edit, thanks to #MarkGlisse again: If the function that contains the std::function is not called main, the optimization happening with gcc5.2 is somewhere between gcc+main and clang, i.e. the function gets reduced to return 70000000; plus some extra code. (Live)
Bonus edit 2, this time mine: If you use -O3, gcc will, (for some reason) as explained in Marco's answer, optimize the std::function to
cmpl $1, (%rdi)
sbbl %eax, %eax
andl $-10000000, %eax
addl $80000000, %eax
ret
and keep the rest as in the not_main case. So I guess at the bottom of the line, one will just have to measure when using std::function.
Both gcc at -O3 and MSVC2015 Release won't optimize it away with this simple code and the lambda would actually be called
#include <functional>
#include <iostream>
int main()
{
int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10; b++)
{
if(a) c++;
c += 7;
}
return c;
};
std::cout << lambda();
return 0;
}
At -O3 this is what gcc generates for the lambda (code from godbolt)
lambda:
cmp DWORD PTR [rdi], 1
sbb eax, eax
and eax, -10
add eax, 80
ret
This is a contrived and optimized way to express the following:
If a was a 0, the first comparison would set the carry flag CR. eax would actually be set to 32 1 values, and'ed with -10 (and that would yield -10 in eax) and then added 80 -> result is 70.
If a was something different from 0, the first comparison would not set the carry flag CR, eax would be set to zero, the and would have no effect and it would be added 80 -> result is 80.
It has to be noted (thanks Marc Glisse) that if the function is marked as cold (i.e. unlikely to be called) gcc performs the right thing and optimizes the call away.
MSVC generates more verbose code but the comparison isn't skipped.
Clang is the only one which gets it right: the lambda hasn't its code optimized more than gcc did but it is not called
mov edi, std::cout
mov esi, 70
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Morale: Clang seems to get it right but the optimization challenge is still open.
Very basic code:
Code#1
int i;
int counterONE = 0;
for (i = 0; i < 5; i++) {
counterONE += 1;
}
cout << counterONE;
Code#2
int i, j;
int CounterONE = 0, counterTWO = 0;
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
counterONE++;
}
counterTWO++;
}
cout << counterONE;
cout << endl << counterTWO;
For both of the codes my questions are:
How do those loops work? Are the stack frames maintained?
How is the internal memory maintained? Is there a queue?
Why for looks like a function(){} body and how to resolve the function body?
And Please don't comment any answer in short I need complete elaboration please.
for is a simple loop that is translated to "goto" (or similar) in the machine code in order to make some commands repeat themselves
for (int i = 0; i < 5; i++) {
some code
}
some more code
will be translated to something like (very simplified)
R_x = 0 // R_x is some register
loop: check if R_x >= 5
if so, go to "after"
some code
increase R_x
go to loop
after: some more code
This code does not involve any recursion, and the importance of the stack here is negligible (only one is used, and only to store the automatic variables).
The real answer is that for is not a function. It is a keyword,
introducing a special type of statement.
Don't be fooled by the parentheses. Parentheses in C++ are overloaded:
they may be an operator (the function call operator), or they may be
punctuation, which is part of the syntax. When they follow the keyword
for (or while or if or switch), they are punctuation, and not
the function call operator; many people, like myself, like to
differentiate the two uses by formatting them differently, putting a
space between the keyword and the opening parentheses when they are
punctation at the statement level, but using no space between the name
of a function and the ( operator. (Everyone I know who does this also
treats all of the formats for casting as if they were a function,
although technically...)
EDIT:
For what it's worth: you can overload the () operator. The overload will be considered in cases where the parentheses are operators (and in the context of a function style cast), but not when they are punctuation.
For both of the codes my question is how do those loops work? are stack frames are maintained?
Let's look at what the compiler generates for your loop. I took your first snippet and built the following program with it1:
#include <stdio.h>
int main( void )
{
int i;
int counterONE=0;
for(i=0;i<5;i++)
{
counterONE+=1;
}
return 0;
}
Here's the equivalent assembly code generated by gcc (using gcc -S)2, annotated by me:
.file "loops.c"
.text
.globl main
.type main, #function
main: ;; function entry point
.LFB2:
pushq %rbp ;; save current frame pointer
.LCFI0:
movq %rsp, %rbp ;; make stack pointer new frame pointer
.LCFI1:
movl $0, -4(%rbp) ;; initialize counterONE to 0
movl $0, -8(%rbp) ;; initialize i to 0
jmp .L2 ;; jump to label L2 below
.L3:
addl $1, -4(%rbp) ;; add 1 to counterONE
addl $1, -8(%rbp) ;; add 1 to i
.L2:
cmpl $4, -8(%rbp) ;; compare i to the value 4
jle .L3 ;; if i is <= 4, jump to L3
movl $0, %eax ;; return 0
leave
ret
The only stack frame involved is the one created for the main function; no additional stack frames are created within the for loop itself. Even if you declared a variable local to the for loop, such as
for ( int i = 0; i < 5; i++ )
{
...
}
or
for ( i = 0; i < 5; i++ )
{
int j;
...
}
a new stack frame will (most likely) not be created for the loop; any variables local to the loop will be created in the enclosing function's frame, although the variable will not be visible to code outside of the loop body.
How internally memory is maintained is there a queue?
No additional data structures are necessary. The only memory involved is the memory for i (which controls the execution of the loop) and counterONE, both of which are maintained on the stack3. They are referred to by their offset from the address stored in the frame pointer (for example, if %rbp contained the address 0x8000, then the memory for i would be stored at address 0x8000 - 8 == 0x7ff8 and the memory for counterONE would be stored at address 0x8000 - 4 == 0x7ffc).
why for looks like a function(){} body how to resolve the function body?
The language grammar tells the compiler how to interpret the code.
Here's the grammar for an iteration statement (taken from the online C 2011 draft):
(6.8.5) iteration-statement:
while ( expression ) statement
do statement while ( expression ) ;
for ( expressionopt ; expressionopt ; expressionopt ) statement
for ( declaration expressionopt ; expressionopt ) statement
Likewise, here's the grammar for a function call:
(6.5.2) postfix-expression:
...
postfix-expression ( argument-expression-listopt )
...
and a function definition:
(6.9.1) function-definition:
declaration-specifiers declarator declaration-listopt compound-statement
During parsing, the compiler breaks the source code up into tokens - keywords, identifiers, constants, string literals, and punctuators. The compiler then tries to match sequences of tokens against the grammar.
So, assuming the source file contains
for ( i = 0; i < 5; i++ )
the compiler will see the for keyword; based on the grammar, it knows to interpret the following ( i = 0; i < 5; i++ ) as a loop control body, rather than a function call or a function definition.
That's the 50,000 foot view, anyway; parsing is a pretty involved subject, and it's only part of what a compiler does. You might want to start here and follow the links.
Just know that this isn't something you're going to pick up in a weekend.
1. Note - formatting really matters. If you want us to help you, you need to make your code as readable as possible. Sloppy formatting, inconsistent indenting, etc., makes it harder to find errors, and makes it less likely that someone will take time out to help.
2. This isn't the complete listing, but the rest isn't relevant to your question.
3. This explanation applies to commonly used architectures like x86, but be aware there are some old and/or oddball architectures that may do things differently.
g++ (4.7.2) and similar versions seem to evaluate constexpr surprisingly fast during compile-time. On my machines in fact much faster than the compiled program during runtime.
Is there a reasonable explanation for that behavior?
Are there optimization techniques involved which are only
applicable at compile-time, that can be executed quicker than actual compiled code?
If so, which?
Here`s my test program and the observed results.
#include <iostream>
constexpr int mc91(int n)
{
return (n > 100)? n-10 : mc91(mc91(n+11));
}
constexpr double foo(double n)
{
return (n>2)? (0.9999)*((unsigned int)(foo(n-1)+foo(n-2))%100):1;
}
constexpr unsigned ack( unsigned m, unsigned n )
{
return m == 0
? n + 1
: n == 0
? ack( m - 1, 1 )
: ack( m - 1, ack( m, n - 1 ) );
}
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
int main(void)
{
constexpr unsigned int compiletime_ack=ack(3,14);
constexpr int compiletime_91=slow91(49);
static_assert( compiletime_ack == 131069, "Must be evaluated at compile-time" );
static_assert( compiletime_91 == 91, "Must be evaluated at compile-time" );
std::cout << compiletime_ack << std::endl;
std::cout << compiletime_91 << std::endl;
std::cout << ack(3,14) << std::endl;
std::cout << slow91(49) << std::endl;
return 0;
}
compiletime:
time g++ constexpr.cpp -std=c++11 -fconstexpr-depth=10000000 -O3
real 0m0.645s
user 0m0.600s
sys 0m0.032s
runtime:
time ./a.out
131069
91
131069
91
real 0m43.708s
user 0m43.567s
sys 0m0.008s
Here mc91 is the usual mac carthy f91 (as can be found on wikipedia) and foo is just a useless function returning real values between about 1 and 100, with a fib runtime complexity.
Both the slow calculation of 91 and the ackermann functions get evaluated with the same arguments by the compiler and the compiled program.
Surprisingly the program would even run faster, just generating code and running it through the compiler than executing the code itself.
At compile-time, redundant (identical) constexpr calls can be memoized, while run-time recursive behavior does not provide this.
If you change every recursive function such as...
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
... to a form that isn't constexpr, but does remember past calculations at runtime:
std::unordered_map< int, boost::optional<unsigned> > results4;
// parameter(s) ^^^ result ^^^^^^^^
unsigned slow91(int n) {
boost::optional<unsigned> &ret = results4[n];
if ( !ret )
{
ret = mc91(mc91(foo(n))%100);
}
return *ret;
}
You will get less surprising results.
compiletime:
time g++ test.cpp -std=c++11 -O3
real 0m1.708s
user 0m1.496s
sys 0m0.176s
runtime:
time ./a.out
131069
91
131069
91
real 0m0.097s
user 0m0.064s
sys 0m0.032s
Memoization
This is a very interesting "discovery" but the answer is probably more simple than you think it is.
Something can be evaluated compile-time when declared constexpr if all values involved are known at compile time (and if the variable where the value is supposed to end up is declared constexpr as well) with that said imagine the following pseudo-code:
f(x) = g(x)
g(x) = x + h(x,x)
h(x,y) = x + y
since every value is known at compile time the compiler can rewrite the above into the, equivalent, below:
f(x) = x + x + x
To put it in words every function call has been removed and replaced with that of the expression itself. What is also applicable is a method called memoization where results of passed calculated expresions are stored away so you only need to do the hard work once.
If you know that g(5) = 15 why calculate it again? instead just replace g(5) with 15 everytime it is needed, This is possible since a function declared as constexpr isn't allowed to have side-effects .
Runtime
In runtime this is not happening (since we didn't tell the code to behave this way). The little guy running through your code will need to jump from f to g to h and then jump back to g from h before it jumps from g to f all while he stores the return value of each function and passing it along to the next one.
Even if this guy is very very tiny and that he doesn't need to jump very very far he still doesn't like jumping back and forth all the time, it takes a lot for him to do this and with that; it takes time.
But in the OPs example, is it really calculated compile-time?
Yes, and to those not believing that the compiler actually calculates this and put it as constants in the finished binary I will supply the relevant assembly instructions from OPs code below (output of g++ -S -Wall -pedantic -fconstexpr-depth=1000000 -std=c++11)
main:
.LFB1200:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $131069, -4(%rbp)
movl $91, -8(%rbp)
movl $131069, %esi # one of the values from constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEj
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movl $91, %esi # the other value from our constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEi
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
# ...
# a lot of jumping is taking place down here
# see the full output at http://codepad.org/Q8D7c41y