While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get inlined. (Obviously this behavior is compiler-specific, but I've tested on a couple of different systems.)
For example, compiling the following small programm:
#include <stdarg.h>
#include <stdio.h>
static inline void test(const char *format, ...)
{
va_list ap;
va_start(ap, format);
vprintf(format, ap);
va_end(ap);
}
int main()
{
test("Hello %s\n", "world");
return 0;
}
will seemingly always result in a (possibly mangled) test symbol appearing in the resulting executable (tested with Clang and GCC in both C and C++ modes on MacOS and Linux). If one modifies the signature of test() to take a plain string which is passed to printf(), the function is inlined from -O1 upwards by both compilers as you'd expect.
I suspect this is to do with the voodoo magic used to implement varargs, but how exactly this is usually done is a mystery to me. Can anybody enlighten me as to how compilers typically implement vararg functions, and why this seemingly prevents inlining?
At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.
The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.
This is the function from the original code compiled with Clang:
test: # #test
subq $200, %rsp
testb %al, %al
je .LBB1_2
# BB#1: # %entry
movaps %xmm0, 48(%rsp)
movaps %xmm1, 64(%rsp)
movaps %xmm2, 80(%rsp)
movaps %xmm3, 96(%rsp)
movaps %xmm4, 112(%rsp)
movaps %xmm5, 128(%rsp)
movaps %xmm6, 144(%rsp)
movaps %xmm7, 160(%rsp)
.LBB1_2: # %entry
movq %r9, 40(%rsp)
movq %r8, 32(%rsp)
movq %rcx, 24(%rsp)
movq %rdx, 16(%rsp)
movq %rsi, 8(%rsp)
leaq (%rsp), %rax
movq %rax, 192(%rsp)
leaq 208(%rsp), %rax
movq %rax, 184(%rsp)
movl $48, 180(%rsp)
movl $8, 176(%rsp)
movq stdout(%rip), %rdi
leaq 176(%rsp), %rdx
movl $.L.str, %esi
callq vfprintf
addq $200, %rsp
retq
and from gcc:
test.constprop.0:
.cfi_startproc
subq $216, %rsp
.cfi_def_cfa_offset 224
testb %al, %al
movq %rsi, 40(%rsp)
movq %rdx, 48(%rsp)
movq %rcx, 56(%rsp)
movq %r8, 64(%rsp)
movq %r9, 72(%rsp)
je .L2
movaps %xmm0, 80(%rsp)
movaps %xmm1, 96(%rsp)
movaps %xmm2, 112(%rsp)
movaps %xmm3, 128(%rsp)
movaps %xmm4, 144(%rsp)
movaps %xmm5, 160(%rsp)
movaps %xmm6, 176(%rsp)
movaps %xmm7, 192(%rsp)
.L2:
leaq 224(%rsp), %rax
leaq 8(%rsp), %rdx
movl $.LC0, %esi
movq stdout(%rip), %rdi
movq %rax, 16(%rsp)
leaq 32(%rsp), %rax
movl $8, 8(%rsp)
movl $48, 12(%rsp)
movq %rax, 24(%rsp)
call vfprintf
addq $216, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
In clang for x86, it is much simpler:
test: # #test
subl $28, %esp
leal 36(%esp), %eax
movl %eax, 24(%esp)
movl stdout, %ecx
movl %eax, 8(%esp)
movl %ecx, (%esp)
movl $.L.str, 4(%esp)
calll vfprintf
addl $28, %esp
retl
There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.
(A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)
Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...
The variable arguments implementation generally have the following algorithm: Take the first address from the stack which is after the format string, and while parsing the input format string use the value at the given position as the required datatype. Now increment the stack parsing pointer with the size of the required datatype, advance in the format string and use the value at the new position as the required datatype ... and so on.
Some values automatically get converted (ie: promoted) to "larger" types (and this is more or less implementation dependant) such as char or short gets promoted to int and float to double.
Certainly, you do not need a format string, but in this case you need to know the type of the arguments passed in (such as: all ints, or all doubles, or the first 3 ints, then 3 more doubles ..).
So this is the short theory.
Now, to the practice, as the comment from n.m. above shows, gcc does not inline functions which have variable argument handling. Possibly there are pretty complex operations going on while handling the variable arguments which would increase the size of the code to an un-optimal size so it is simply not worth inlining these functions.
EDIT:
After doing a quick test with VS2012 I don't seem to be able to convince the compiler to inline the function with the variable arguments.
Regardless of the combination of flags in the "Optimization" tab of the project there is always a call totest and there is always a test method. And indeed:
http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx
says that
Even with __forceinline, the compiler cannot inline code in all circumstances. The compiler cannot inline a function if:
...
The function has a variable argument list.
The point of inlining is that it reduces function call overhead.
But for varargs, there is very little to be gained in general.
Consider this code in the body of that function:
if (blah)
{
printf("%d", va_arg(vl, int));
}
else
{
printf("%s", va_arg(vl, char *));
}
How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.
I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.
A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.
A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.
I do not see a feasible path for this second case.
Related
so I get it that pre-increment is faster than post-increment as no copy of the value is made. But let's say I have this:
char * temp = "abc";
char c = 0;
Now if I want to assign 'a' to c and increment temp so that it now points to 'b' I would do it like this :
c = *temp++;
but pre-increment should be faster so i thought :
c = *temp;
++temp;
but it turns out *temp++ is faster according to my measurements.
Now I don't quite get it why and how, so if someone is willing to enlighten me, please do.
First, pre-increment is only potentially faster for the reason you state. But for basic types like pointers, in practice that's not the case, because the compiler can generate optimized code. Ref. Is there a performance difference between i++ and ++i in C++? for more details.
Second, to a decent optimizing compiler, there's no difference between the two alternatives you mentioned. It's very likely the same exact machine code will be generated for both cases (unless you disabled optimizations).
To illustrate this : on my compiler, the following code is generated when optimizations are disabled :
# c = *temp++;
movq temp(%rip), %rax
leaq 1(%rax), %rdx
movq %rdx, temp(%rip)
movzbl (%rax), %eax
movb %al, c(%rip)
# c = *temp;
movq temp(%rip), %rax
movzbl (%rax), %eax
movb %al, c(%rip)
# ++temp;
movq temp(%rip), %rax
addq $1, %rax
movq %rax, temp(%rip)
Notice there's an additional movq instruction in the latter, which could account for slower run time.
When enabling optimizations however, this turns into :
# c = *temp++;
movq temp(%rip), %rax
leaq 1(%rax), %rdx
movq %rdx, temp(%rip)
movzbl (%rax), %eax
movb %al, c(%rip)
# c = *temp;
# ++temp;
movq temp(%rip), %rax
movzbl (%rax), %edx
addq $1, %rax
movq %rax, temp(%rip)
movb %dl, c(%rip)
Other than a different order of the instructions, and the choice of using addq vs. leaq for the increment, there's no real difference between these two. If you do get (measurably) different performance between these two, then that's likely due to the specific cpu architecture (possibly a more optimal use of the pipeline eg.).
This sounds too much of a simple question to not be answered already somewhere, but I tried to look around and I couldn't find any simple answer. Take the following example:
class vec
{
double x;
double y;
};
inline void sum_x(vec & result, vec & a, vec & b)
{
result.x = a.x + b.x;
}
inline void sum(vec & result, vec & a, vec & b)
{
sum_x(result, a, b);
result.y = a.y + b.y;
}
What happens when I call sum and compile? Will both sum and sum_x be inlined, so that it will just translate to an inline assembly code to sum the two components?
This looks like a trivial example, but I am working with a vector class that has the dimensionality defined in a template, so iterating over operations on vectors looks a bit like this.
inline is just a hint to the compiler. Whether the compiler actually inlines the function or not is a different question. For gcc there is an always inline attribute to force this.
__attribute__((always_inline));
With always inlining you should achieve what you described (code generate as if it where written in one function).
However, with all the optimizations and transformations applied by compilers you can only be sure if you check the generated code (assembly)
Yes, inlining may be applied recursively.
The entire set of operations that you're performing here can be inlined at the call site.
Note that this has very little to do with your use of the inline keyword, which (other than its effect on the ODR — which can be very noticeable) is just a hint and nowadays mostly ignored for purposes of actually inlining. The functions will be inlined because your clever compiler can see that they are good candidates for it.
The only way you can actually tell whether it's doing this is to inspect the resulting assembly yourself.
It depends. inline is just a hint to the compiler that it might want to think about inlining that function. It's entirely possible for a compiler to inline both calls, but that's up to the implementation.
As an example, here's some prettified assembly output from GCC with and without those inlines of this simple program:
int main()
{
vec a;
vec b;
std::cin >> a.x;
std::cin >> a.y;
sum(b,a,a);
std::cout << b.x << b.y;
return 0;
}
With inlining:
main:
subq $40, %rsp
leaq 16(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 24(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
movsd 24(%rsp), %xmm0
movapd %xmm0, %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 8(%rsp)
movsd 16(%rsp), %xmm0
addsd %xmm0, %xmm0
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movsd 8(%rsp), %xmm0
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movl $0, %eax
addq $40, %rsp
ret
subq $8, %rsp
movl std::__ioinit, %edi
call std::ios_base::Init::Init()
movl $__dso_handle, %edx
movl std::__ioinit, %esi
movl std::ios_base::Init::~Init(), %edi
call __cxa_atexit
addq $8, %rsp
ret
Without:
sum_x(vec&, vec&, vec&):
movsd (%rsi), %xmm0
addsd (%rdx), %xmm0
movsd %xmm0, (%rdi)
ret
sum(vec&, vec&, vec&):
movsd (%rsi), %xmm0
addsd (%rdx), %xmm0
movsd %xmm0, (%rdi)
movsd 8(%rsi), %xmm0
addsd 8(%rdx), %xmm0
movsd %xmm0, 8(%rdi)
ret
main:
pushq %rbx
subq $48, %rsp
leaq 32(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 40(%rsp), %rsi
movl std::cin, %edi
call std::basic_istream<char, std::char_traits<char> >& std::basic_istream<char, std::char_traits<char> >::_M_extract<double>(double&)
leaq 32(%rsp), %rdx
movq %rdx, %rsi
leaq 16(%rsp), %rdi
call sum(vec&, vec&, vec&)
movq 24(%rsp), %rbx
movsd 16(%rsp), %xmm0
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movq %rbx, 8(%rsp)
movsd 8(%rsp), %xmm0
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
movl $0, %eax
addq $48, %rsp
popq %rbx
ret
subq $8, %rsp
movl std::__ioinit, %edi
call std::ios_base::Init::Init()
movl $__dso_handle, %edx
movl std::__ioinit, %esi
movl std::ios_base::Init::~Init(), %edi
call __cxa_atexit
addq $8, %rsp
ret
As you can see, GCC inlined both functions when asked to.
If your assembly is a bit rusty, simply note that sum is present and called in the second version, but not in the first.
As mentioned, the inline keyword is just a hint. However, compilers do an amazing job here (even without your hints), and they do inline recursively.
If you're really interested in this stuff, I recommend learning a bit about compiler design. I've been studying it recently and it blew my mind what complex beasts our production-quality compilers are today.
About inlining, this is one of the things that compilers tend to do an extremely good job at. It was so by necessity, since if you look at how we write code in C++, we often write accessor functions (methods) just to do nothing more than return the value of a single variable. C++'s popularity hinged in large part on the idea that we can write this kind of code utilizing concepts like information hiding without being forced into creating software that is slower than its C-like equivalent, so you often found optimizers as early as the 90s doing a really good job at inlining (and recursively).
For this next part, it's somewhat speculative as I'm somewhat assuming that what I've been reading and studying about compiler design is applicable towards the production-quality compilers we're using today. Who knows exactly what kind of advanced tricks they're all applying?
... but I believe compilers typically inline code before you get to the kind of machine code level. This is because one of the keys to an optimizer is efficient instruction selection and register allocation. To do that, it needs to know all the memory (variables) the code is going to be working with inside a procedure. It wants that in a form that is somewhat abstract where specific registers haven't been chosen yet but are ready to be assigned. So inlining is usually done at this intermediate representation stage, before you get to the kind of assembly realm of specific machine instructions and registers, so that the compiler can gather up all that information before it does its magical optimizations. It might even apply some heuristics here to kind of 'try' inlining or unrolling away branches of code prior to actually doing it.
A lot of linkers can even inline code, and I'm not sure how that works. I think when they can do that, the object code is actually still in an intermediate representation form, still somewhat abstracted away from specific machine-level instructions and registers. Then the linker can still move that code between object files and inline it, deferring that code generation/optimization process until after.
The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?
Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++
The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)
Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.
The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.
For illustrative purposes, let's say I want to implement a generic integer comparing function. I can think of a few approaches for the definition/invocation of the function.
(A) Function Template + functors
template <class Compare> void compare_int (int a, int b, const std::string& msg, Compare cmp_func)
{
if (cmp_func(a, b)) std::cout << "a is " << msg << " b" << std::endl;
else std::cout << "a is not " << msg << " b" << std::endl;
}
struct MyFunctor_LT {
bool operator() (int a, int b) {
return a<b;
}
};
And this would be a couple of calls to this function:
MyFunctor_LT mflt;
MyFunctor_GT mfgt; //not necessary to show the implementation
compare_int (3, 5, "less than", mflt);
compare_int (3, 5, "greater than", mflt);
(B) Function template + lambdas
We would call compare_int like this:
compare_int (3, 5, "less than", [](int a, int b) {return a<b;});
compare_int (3, 5, "greater than", [](int a, int b) {return a>b;});
(C) Function template + std::function
Same template implementation, invocation:
std::function<bool(int,int)> func_lt = [](int a, int b) {return a<b;}; //or a functor/function
std::function<bool(int,int)> func_gt = [](int a, int b) {return a>b;};
compare_int (3, 5, "less than", func_lt);
compare_int (3, 5, "greater than", func_gt);
(D) Raw "C-style" pointers
Implementation:
void compare_int (int a, int b, const std::string& msg, bool (*cmp_func) (int a, int b))
{
...
}
bool lt_func (int a, int b)
{
return a<b;
}
Invocation:
compare_int (10, 5, "less than", lt_func);
compare_int (10, 5, "greater than", gt_func);
With those scenarios laid out, we have in each case:
(A) Two template instances (two different parameters) will be compiled and allocated in memory.
(B) I would say also two template instances would be compiled. Each lambda is a different class. Correct me if I'm wrong, please.
(C) Only one template instance would be compiled, since template parameter is always te same: std::function<bool(int,int)>.
(D) Obviously we only have one instance.
Needless to day, it doesn't make a difference for such a naive example. But when working with dozens (or hundreds) of templates, and numerous functors, the compilation times and memory usage difference can be substantial.
Can we say that in many circumstances (i.e., when using too many functors with the same signature) std::function (or even function pointers) must be preferred over templates+raw functors/lambdas? Wrapping your functor or lambda with std::function may be very convenient.
I am aware that std::function (function pointer too) introduces an overhead. Is it worth it?
EDIT. I did a very simple benchmark using the following macros and a very common standard library function template (std::sort):
#define TEST(X) std::function<bool(int,int)> f##X = [] (int a, int b) {return (a^X)<(b+X);}; \
std::sort (v.begin(), v.end(), f##X);
#define TEST2(X) auto f##X = [] (int a, int b) {return (a^X)<(b^X);}; \
std::sort (v.begin(), v.end(), f##X);
#define TEST3(X) bool(*f##X)(int, int) = [] (int a, int b) {return (a^X)<(b^X);}; \
std::sort (v.begin(), v.end(), f##X);
Results are the following regarding size of the generated binaries (GCC at -O3):
Binary with 1 TEST macro instance: 17009
1 TEST2 macro instance: 9932
1 TEST3 macro instance: 9820
50 TEST macro instances: 59918
50 TEST2 macro instances: 94682
50 TEST3 macro instances: 16857
Even if I showed the numbers, it is a more qualitative than quantitative benchmark. As we were expecting, function templates based on the std::function parameter or function pointer scale better (in terms of size), as not so many instances are created. I did not measure runtime memory usage though.
As for the performance results (vector size is 1000000 of elements):
50 TEST macro instances: 5.75s
50 TEST2 macro instances: 1.54s
50 TEST3 macro instances: 3.20s
It is a notable difference, and we must not neglect the overhead introduced by std::function (at least if our algorithms consist of millions of iterations).
As others have already pointed out, lambdas and function objects are likely to be inlined, especially if the body of the function is not too long. As a consequence, they are likely to be better in terms of speed and memory usage than the std::function approach. The compiler can optimize your code more aggressively if the function can be inlined. Shockingly better. std::function would be my last resort for this reason, among other things.
But when working with dozens (or hundreds) of templates, and numerous
functors, the compilation times and memory usage difference can be
substantial.
As for the compilation times, I wouldn't worry too much about it as long as you are using simple templates like the one shown. (If you are doing template metaprogramming, yeah, then you can start worrying.)
Now, the memory usage: By the compiler during compilation or by the generated executable at run time? For the former, same holds as for the compilation time. For the latter: Inlined lamdas and function objects are the winners.
Can we say that in many circumstances std::function (or even
function pointers) must be preferred over templates+raw
functors/lambdas? I.e. wrapping your functor or lambda with
std::function may be very convenient.
I am not quite sure how to answer to that one. I cannot define "many circumstances".
However, one thing I can say for sure is that type erasure is a way to avoid / reduce code bloat due to templates, see Item 44: Factor parameter-independent code out of templates in Effective C++. By the way, std::function uses type erasure internally. So yes, code bloat is an issue.
I am aware that std::function (function pointer too) introduces an overhead. Is it worth it?
"Want speed? Measure." (Howard Hinnant)
One more thing: function calls through function pointers can be inlined (even across compilation units!). Here is a proof:
#include <cstdio>
bool lt_func(int a, int b)
{
return a<b;
}
void compare_int(int a, int b, const char* msg, bool (*cmp_func) (int a, int b)) {
if (cmp_func(a, b)) printf("a is %s b\n", msg);
else printf("a is not %s b\n", msg);
}
void f() {
compare_int (10, 5, "less than", lt_func);
}
This is a slightly modified version of your code. I removed all the iostream stuff because it makes the generated assembly cluttered. Here is the assembly of f():
.LC1:
.string "a is not %s b\n"
[...]
.LC2:
.string "less than"
[...]
f():
.LFB33:
.cfi_startproc
movl $.LC2, %edx
movl $.LC1, %esi
movl $1, %edi
xorl %eax, %eax
jmp __printf_chk
.cfi_endproc
Which means, that gcc 4.7.2 inlined lt_func at -O3. In fact, the generated assembly code is optimal.
I have also checked: I moved the implementation of lt_func into a separate source file and enabled link time optimization (-flto). GCC still happily inlined the call through the function pointer! It is nontrivial and you need a quality compiler to do that.
Just for the record, and that you can actually feel the overhead of the std::function approach:
This code:
#include <cstdio>
#include <functional>
template <class Compare> void compare_int(int a, int b, const char* msg, Compare cmp_func)
{
if (cmp_func(a, b)) printf("a is %s b\n", msg);
else printf("a is not %s b\n", msg);
}
void f() {
std::function<bool(int,int)> func_lt = [](int a, int b) {return a<b;};
compare_int (10, 5, "less than", func_lt);
}
yields this assembly at -O3 (approx. 140 lines):
f():
.LFB498:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA498
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $1, %edi
subq $80, %rsp
.cfi_def_cfa_offset 96
movq %fs:40, %rax
movq %rax, 72(%rsp)
xorl %eax, %eax
movq std::_Function_handler<bool (int, int), f()::{lambda(int, int)#1}>::_M_invoke(std::_Any_data const&, int, int), 24(%rsp)
movq std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}> const&, std::_Manager_operation), 16(%rsp)
.LEHB0:
call operator new(unsigned long)
.LEHE0:
movq %rax, (%rsp)
movq 16(%rsp), %rax
movq $0, 48(%rsp)
testq %rax, %rax
je .L14
movq 24(%rsp), %rdx
movq %rax, 48(%rsp)
movq %rsp, %rsi
leaq 32(%rsp), %rdi
movq %rdx, 56(%rsp)
movl $2, %edx
.LEHB1:
call *%rax
.LEHE1:
cmpq $0, 48(%rsp)
je .L14
movl $5, %edx
movl $10, %esi
leaq 32(%rsp), %rdi
.LEHB2:
call *56(%rsp)
testb %al, %al
movl $.LC0, %edx
jne .L49
movl $.LC2, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
.LEHE2:
.L24:
movq 48(%rsp), %rax
testq %rax, %rax
je .L23
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
.LEHB3:
call *%rax
.LEHE3:
.L23:
movq 16(%rsp), %rax
testq %rax, %rax
je .L12
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
.LEHB4:
call *%rax
.LEHE4:
.L12:
movq 72(%rsp), %rax
xorq %fs:40, %rax
jne .L50
addq $80, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 16
popq %rbx
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L49:
.cfi_restore_state
movl $.LC1, %esi
movl $1, %edi
xorl %eax, %eax
.LEHB5:
call __printf_chk
jmp .L24
.L14:
call std::__throw_bad_function_call()
.LEHE5:
.L32:
movq 48(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L20
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
call *%rcx
.L20:
movq 16(%rsp), %rax
testq %rax, %rax
je .L29
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
call *%rax
.L29:
movq %rbx, %rdi
.LEHB6:
call _Unwind_Resume
.LEHE6:
.L50:
call __stack_chk_fail
.L34:
movq 48(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L20
leaq 32(%rsp), %rsi
movl $3, %edx
movq %rsi, %rdi
call *%rcx
jmp .L20
.L31:
movq %rax, %rbx
jmp .L20
.L33:
movq 16(%rsp), %rcx
movq %rax, %rbx
testq %rcx, %rcx
je .L29
movl $3, %edx
movq %rsp, %rsi
movq %rsp, %rdi
call *%rcx
jmp .L29
.cfi_endproc
Which approach would you like to choose when it comes to performance?
If you bind your lambda to a std::function then your code will run slower because it will no longer be inlineable, invoking goes through function pointer and creation of the function object possibly requires a heap allocation if the size of the lambda (= size of captured state) exceeds the small buffer limit (which is equal to the size of one or two pointers on GCC IIRC).
If you keep your lambda, e.g. auto a = []{};, then it will be as fast as an inlineable function (perhaps even faster because there's no conversion to function pointer when passed as an argument to a function.)
The object code generated by lambda and inlineable function objects will be zero when compiling with optimizations enabled (-O1 or higher in my tests). Occasionally the compiler may refuse to inline but that normally only happens when trying to inline big function bodies.
You can always look at the generated assembly if you want to be certain.
I will discuss what happens naively, and common optimizations.
(A) Function Template + functors
In this case, there will be one functor whose type and arguments fully describes what happens when you invoke (). There will be one instance of the template function per functor passed to it.
While the functor has a minimium size of 1 byte, and must technically be copied, the copy is a no-op (not even the 1 byte of space need be copied: a dumb compiler/low optimization settings may result in it being copied anyhow).
Optimizing the existance of that functor object away is an easy thing for compilers to do: the method is inline and has the same symbol name, so this will even happen over multiple compilation units. Inlining the call is also very easy.
If you had multiple function objects with the same implementation, havine one instance of the functor despite this takes a bit of effort, but some compilers can do it. Inlining the template functions may also be easy. In your toy example, as the inputs are known at compile time, branches can be evaluated at compile time, dead code eliminated, and everything reduced to a single std::cout call.
(B) Function template + lambdas
In this case, each lambda is its own type, and each closure instance is an undefined size instance of that lambda (usually 1 byte as it captures nothing). If an identical lambda is defined and used at different spots, these are different types. Each call location for the function object is a distinct instantiation of the template function.
Removing the existance of the 1 byte closures, assuming they are stateless, is easy. Inlining them is also easy. Removing duplicate instances of the template function with the same implementation but different signature is harder, but some compilers will do it. Inlining said functions is no harder than above. In your toy example, as the inputs are known at compile time, branches can be evaluated at compile time, dead code eliminated, and everything reduced to a single std::cout call.
(C) Function template + std::function
std::function is a type erasure object. An instance of std::function with a given signature has the same type as another. However, the constructor for std::function is templated on the type passed in. In your case, you are passing in a lambda -- so each location where you initialize a std::function with a lambda generates a distinct std::function constructor, which does unknown code.
A typical implementation of std::function will use the pImpl pattern to store a pointer to the abstract interface to a helper object that wraps your callable, and knows how to copy/move/call it with the std::function signature. One such callable type is created per type std::function is constructed from per std::function signature it is constructed to.
One instance of the function will be created, taking a std::function.
Some compilers can notice duplicate methods and use the same implementation for both, and maybe pull of a similar trick for (much) of their virtual function table (but not all, as dynamic casting requires they be different). This is less likely to happen than the earlier duplicate function elimination. The code in the duplicated methods used by the std::function helper is probably simpler than other duplicated functions, so that could be cheaper.
While the template function can be inlined, I am not aware of a C++ compiler that can optimize the existance of a std::function away, as they are usually implemented as library solutions consisting of relatively complex and opaque code to the compiler. So while theoretically it could be evaluated as all information is there, in practice the std::function will not be inlined into the template function, and no dead code elimination will occur. Both branches will make it into the resulting binary, together with a pile of std::function boilerplate for itself and its helper.
Calling a std::function is roughly as expensive as making a virtual method call -- or, as a rule of thumb, very roughly as expensive as two function pointer calls.
(D) Raw "C-style" pointers
A function is created, its address taken, this address is passed to compare_int. It then dereferences that pointer to find the actual function, and calls it.
Some compilers are good at noticing that the function pointer is created from a literal, then inlining the call here: not all compilers can do this, and no or few compilers can do this in the general case. If they cannot (because the initialization isn't from a literal, because the interface is in one compilation unit while the implementation is in another) there is a significant cost to following a function pointer to data -- the computer tends to be unable to cache the location it is going to, so there is a pipeline stall.
Note that you can call the raw "C-style" pointers with stateless lambdas, as stateless lambdas implicitly convert to function pointers. Also note that this example is strictly weaker than the other ones: it does not accept stateful functions. The version with the same power would be a C-style function that takes both the pair of ints and a void* state.
You should use auto keyword with lambda functions, not std::function. That way you get unique type and no runtime overhead of std::function.
Also, as dyp suggests, stateless (that is, no captures) lamba functions can be converted to function pointer.
In A, B and C you probably end up with a binary that does not contain a comparator, nor any templates. It will literally inline all of the comparison and probably even remove the non-true branch of the printing - in effect, it'll be "puts" calls for the actual result, without any checks left to do.
In D your compiler can't do that.
So rather, it's more optimal in this example. It's also more flexible - an std::function can hide members being stored, or it just being a plain C function, or it being a complex object. It even allows you to derive the implementation from aspects of the type - if you could do a more efficient comparison for POD types you can implement that and forget about it for the rest.
Think of it this way - A and B are higher abstract implementations that allow you to tell the compiler "This code is probably best implemented for each type separately, so do that for me when I use it". C is one that says "There'll be multiple comparison operators that are complex, but they'll all look like this so only make one implementation of the compare_int function". In D you tell it "Don't bother, just make me these functions. I know best." None is unequivocally better than the rest.
This question already has answers here:
Do temp variables slow down my program?
(5 answers)
Closed 5 years ago.
Let's say we have two functions:
int f();
int g();
I want to get the sum of f() and g().
First way:
int fRes = f();
int gRes = g();
int sum = fRes + gRes;
Second way:
int sum = f() + g();
Will be there any difference in performance in this two cases?
Same question for complex types instead of ints
EDIT
Do I understand right i should not worry about performance in such case (in each situation including frequently performed tasks) and use temporary variables to increase readability and to simplify the code ?
You can answer questions like this for yourself by compiling to assembly language (with optimization on, of course) and inspecting the output. If I flesh your example out to a complete, compilable program...
extern int f();
extern int g();
int direct()
{
return f() + g();
}
int indirect()
{
int F = f();
int G = g();
return F + G;
}
and compile it (g++ -S -O2 -fomit-frame-pointer -fno-exceptions test.cc; the last two switches eliminate a bunch of distractions from the output), I get this (further distractions deleted):
__Z8indirectv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
__Z6directv:
pushq %rbx
call __Z1fv
movl %eax, %ebx
call __Z1gv
addl %ebx, %eax
popq %rbx
ret
As you can see, the code generated for both functions is identical, so the answer to your question is no, there will be no performance difference. Now let's look at complex numbers -- same code, but s/int/std::complex<double>/g throughout and #include <complex> at the top; same compilation switches --
__Z8indirectv:
subq $72, %rsp
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
__Z6directv:
subq $72, %rsp
call __Z1gv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 32(%rsp)
movq 8(%rsp), %rax
movq %rax, 40(%rsp)
call __Z1fv
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movq (%rsp), %rax
movq %rax, 48(%rsp)
movq 8(%rsp), %rax
movq %rax, 56(%rsp)
movsd 48(%rsp), %xmm0
addsd 32(%rsp), %xmm0
movsd 56(%rsp), %xmm1
addsd 40(%rsp), %xmm1
addq $72, %rsp
ret
That's a lot more instructions and the compiler isn't doing a perfect optimization job, it looks like, but nonetheless the code generated for both functions is identical.
I think in the second way it is assigned to a temporary variable when the function returns a value anyway. However, it becomes somewhat significant when you need to use the values from f() and g() more than once case in which storing them to a variable instead of recalculating them each time can help.
If you have optimization turned off, there likely will be. If you have it turned on, they will likely result in identical code. This is especially true of you label the fRes and gRes as const.
Because it's legal for the compiler to elide the call to the copy constructor if fRes and gRes are complex types they will not differ in performance for complex types either.
Someone mentioned using fRes and gRes more than once. And of course, this is obviously potentially less optimal as you would have to call f() or g() more than once.
As you wrote it, there's only a subtle difference (which another answer addresses, that there's a sequence point in the one vs the other).
They would be different if you had done this instead:
int fRes;
int gRes;
fRes = f();
fRes = g();
int sum = fRes + gRes;
(Imagining that int as actually some other type with a non-trivial default constructor.)
In the case here, you invoke default constructors and then assignment operators, which is potentially more work.
It depends entirely on what optimizations the compiler performs. The two could compile to slightly different or exactly the same bytecode. Even if slightly different, you couldn't measure a statistically significant difference in time and space costs for those particular samples.
On my platform with full optimization turned on, a function returning the sum from both different cases compiled to exactly the same machine code.
The only minor difference between the two examples is that the first guarantees the order in which f() and g() are called, so in theory the second allows the compiler slightly more flexibility. Whether this ever makes a difference would depend on what f() and g() actually do and, perhaps, whether they can be inlined.
There is a slight difference between the two examples. In expression f() + g() there is no sequence point, whereas when the calls are made in different statements there are sequence points at the end of each statement.
The absence of a sequence point means the order these two functions are called is unspecified, they can be called in any order, which might help the compiler optimize it.