Wondering if the following surprises anyone, as it did me? Alex Allain's article here on using constexpr shows the following factorial example:
constexpr factorial (int n)
{
return n > 0 ? n * factorial( n - 1 ) : 1;
}
And states:
Now you can use factorial(2) and when the compiler sees it, it can
optimize away the call and make the calculation entirely at compile
time.
I tried this in VS2015 in Release mode with full optimizations on (/Ox) and stepped through the code in the debugger viewing the assembly and saw that the factorial calculation was not done at compilation.
Using GCC v5.4.0 with --std=C++14, I must use /O2 or /O3 before the calculation is performed at compile time. I was surprised thought that using just /O the calculation did not occur at compilation time.
Main main question is: Why is VS2015 not performing this calculation at compilation time?
It depends on the context of the function call.
For example, the following obviously could never be calculated at compile time:
int x;
std::cin >> x;
std::cout << factorial(x);
On the other hand, this context would require the answer at compile time:
class Foo {
int x[factorial(4)];
};
constexpr functions are only guaranteed to be evaluated at compile time if they are called from a constexpr context; otherwise it is up to the compiler to choose whether or not to eval at compile time (assuming such an optimization is possible, again, depending on the context).
You have to use it in const expression, as:
constexpr auto res = factorial(2);
else computation can be done at runtime.
constexpr is neither necessary nor sufficient to compile time evaluation of a function.
It's not sufficient, even aside from the fact that the arguments obviously also have to be constant expressions. Even if that is true, a conforming compiler does not have to evaluate it at compile time. It only has to be evaluated at compile time if it is in a constexpr context. Such as, assigning the result of the computation to a constexpr variable, or using the value as an array size, or as a non-type template parameter.
The other point, is that the compiler is completely capable of evaluating things at compile time, even without constexpr. There is a lot of confusion about this, and it's not clear why. compile time evaluation of constexpr functions fundamentally just boils down to constant propagation, and compilers have been doing this optimization since forever: https://godbolt.org/g/Sy214U.
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n-1);
}
int foo() { return factorial(5); }
On gcc 6.3 with O3 (and 14) yields:
foo():
mov eax, 120
ret
In essence, outside of the specific case where you absolutely force compile time evaluation by assigning a constexpr function to another constexpr variable, compile time evaluation has more to do with the quality of your optimizer than the standard.
This is something I've always thought to be true but have never had any validation. Consider a very simple function:
int subtractFive(int num) {
return num -5;
}
If a call to this function uses a compile time constant such as
getElement(5);
A compiler with optimizations turned on will very likely inline this. What is unclear to me however, is if the num - 5 will be evaluated at runtime or compile time. Will expression simplification extend recursively through inlined functions in this manner? Or does it not transcend functions?
We can simply look at the generated assembly to find out. This code:
int subtractFive(int num) {
return num -5;
}
int main(int argc, char *argv[]) {
return subtractFive(argc);
}
compiled with g++ -O2 yields
leal -5(%rdi), %eax
ret
So the function call was indeed reduced to a single instruction. This optimization technique is known as inlining.
One can of course use the same technique to see how far a compiler will go with that, e.g. the slightly more complicated
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main(int argc, char *argv[]) {
return foo(argc);
}
still gets compiled to
leal -25(%rdi,%rdi,4), %eax
ret
so here both functions where just eliminated at compile time. If the input to foo is known at compile time, the function call will (in this case) simply be replaced by the resulting constant at compile time (Live).
The compiler can also combine this inlining with constant folding, to replace the function call with its fully evaluated result if all arguments are compile time constants. For example,
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main() {
return foo(7);
}
compiles to
mov eax, 10
ret
which is equivalent to
int main () {
return 10;
}
A compiler will always do this where it thinks it is a good idea, and it is (usually) way better in optimizing code on this low level than you are.
It's easy to do a little test; consider the following
int foo(int);
int bar(int x) { return x-5; }
int baz() { return foo(bar(5)); }
Compiling with g++ -O3 the asm output for function baz is
xorl %edi, %edi
jmp _Z3fooi
This code loads a 0 in the first parameter and then jumps into the code of foo. So the code from bar is completely disappeared and the computation of the value to pass to foo has been done at compile time.
In addition returning the value of calling the function became just a jump to the function code (this is called "tail call optimization").
A smart compiler will evaluate this at compile time and will replace the getElement(5) because it will never have a different result. None of the variables are considered volatile.
I have used constexpr to calculate hash codes in compile times. Code compiles correctly, runs correctly. But I dont know, if hash values are compile time or run time. If I trace code in runtime, I dont do into constexpr functions. But, those are not traced even for runtime values (calculate hash for runtime generated string - same methods).
I have tried to look into dissassembly, but I quite dont understand it
For debug purposes, my hash code is only string length, using this:
constexpr inline size_t StringLengthCExpr(const char * const str) noexcept
{
return (*str == 0) ? 0 : StringLengthCExpr(str + 1) + 1;
};
I have ID class created like this
class StringID
{
public:
constexpr StringID(const char * key);
private:
const unsigned int hashID;
}
constexpr inline StringID::StringID(const char * key)
: hashID(StringLengthCExpr(key))
{
}
If I do this in program main method
StringID id("hello world");
I got this disassembled code (part of it - there is a lot of more from inlined methods and other stuff in main)
;;; StringID id("hello world");
lea eax, DWORD PTR [-76+ebp]
lea edx, DWORD PTR [id.14876.0]
mov edi, eax
mov esi, edx
mov ecx, 4
mov eax, ecx
shr ecx, 2
rep movsd
mov ecx, eax
and ecx, 3
rep movsb
// another code
How can I tell from this, that "hash value" is a compile time. I donĀ“t see any constant like 11 moved to register. I am not quite good with ASM, so maybe it is correct, but I am not sure what to check or how to be sure, that "hash code" values are compile time and not computed in runtime from this code.
(I am using Visual Studio 2013 + Intel C++ 15 Compiler - VS Compiler is not supporting constexpr)
Edit:
If I change my code and do this
const int ix = StringLengthCExpr("hello world");
mov DWORD PTR [-24+ebp], 11 ;55.15
I have got the correct result
Even with this
change private hashID to public
StringID id("hello world");
// mov DWORD PTR [-24+ebp], 11 ;55.15
printf("%i", id.hashID);
// some other ASM code
But If I use private hashID and add Getter
inline uint32 GetHashID() const { return this->hashID; };
to ID class, then I got
StringID id("hello world");
//see original "wrong" ASM code
printf("%i", id.GetHashID());
// some other ASM code
The most convenient way is to use your constexpr in a static_assert statement. The code will not compile when it is not evaluated during compile time and the static_assert expression will give you no overhead during runtime (and no unnecessary generated code like with a template solution).
Example:
static_assert(_StringLength("meow") == 4, "The length should be 4!");
This also checks whether your function is computing the result correctly or not.
If you want to ensure that a constexpr function is evaluated at compile time, use its result in something which requires compile-time evaluation:
template <size_t N>
struct ForceCompileTimeEvaluation { static constexpr size_t value = N; };
constexpr inline StringID::StringID(const char * key)
: hashID(ForceCompileTimeEvaluation<StringLength(key)>::value)
{}
Notice that I've renamed the function to just StringLength. Name which start with an underscore followed by an uppercase letter, or which contain two consecutive underscores, are not legal in user code. They're reserved for the implementation (compiler & standard library).
In the future(c++20) you can use the consteval specifier to declare a function, which must be evaluated at compile time, thus requiring a constant expression context.
The consteval specifier declares a function or function template to be
an immediate function, that is, every call to the function must
(directly or indirectly) produce a compile time constant expression.
An example from cppreference(see consteval):
consteval int sqr(int n) {
return n*n;
}
constexpr int r = sqr(100); // OK
int x = 100;
int r2 = sqr(x); // Error: Call does not produce a constant
consteval int sqrsqr(int n) {
return sqr(sqr(n)); // Not a constant expression at this point, but OK
}
constexpr int dblsqr(int n) {
return 2*sqr(n); // Error: Enclosing function is not consteval and sqr(n) is not a constant
}
There are a few ways to force compile-time evaluation. But these aren't as flexible and easy to setup as what you'd expect when using constexpr. And they don't help you in finding if the compile-time constants are actually been used.
What you'd want for constexpr is to work where you expect it to beneficial. Therefor you try to meet its requirements. But Then you need to test if the code you expect to be generated at compile-time has been generated, and if the users actually consume the generated result or trigger the function at runtime.
I've found two ways to detect if a class or (member)function is using the compile-time or runtime evaluated path.
Using the property of constexpr functions returning true from the noexcept operator (bool noexcept( expression )) if evaluated at compile-time. Since the generated result will be a compile-time constant. This method is quite accessible and usable with Unit-testing.
(Be aware that marking these functions explicitly noexcept will break the test.)
Source: cppreference.com (2017/3/3)
Because the noexcept operator always returns true for a constant expression, it can be used to check if a particular invocation of a constexpr function takes the constant expression branch (...)
(less convenient) Using a debugger: By putting a break-point inside the function marked constexpr. Whenever the break-point isn't triggered, the compiler evaluated result was used. Not the easiest, but possible for incidental checking.
Soure: Microsoft documentation (2017/3/3)
Note: In the Visual Studio debugger, you can tell whether a constexpr function is being evaluated at compile time by putting a breakpoint inside it. If the breakpoint is hit, the function was called at run-time. If not, then the function was called at compile time.
I've found both these methods to be useful while experimenting with constexpr. Although I haven't done any testing with environments outside VS2017. And haven't been able to find an explicit statement supporting this behaviour in the current draft of the standard.
The following trick can help to check if the constexpr function has been evaluated during compile time only:
With gcc you can compile the source file with assembly listing + c sources; given that both the constexpr and its calls are in source file try.cpp
gcc -std=c++11 -O2 -Wa,-a,-ad try.cpp | c++filt >try.lst
If the constexpr function has been evaluated during run time time then you will see the compiled function and a call instruction (call function_name on x86) in the assembly listing try.lst (note that c++filt command has undecorated the linker names)
Interesting that it I always see a call if compiled without optimization (without -O2 or -O3 option).
Simply put it in constexpr variable.
constexpr StringID id("hello world");
constexpr int ix = StringLengthCExpr("hello world");
A constexpr variable is always a real constant expression. If it compiles, it is computed on compile time.
Thanks to C++11 we received the std::function family of functor wrappers. Unfortunately, I keep hearing only bad things about these new additions. The most popular is that they are horribly slow. I tested it and they truly suck in comparison with templates.
#include <iostream>
#include <functional>
#include <string>
#include <chrono>
template <typename F>
float calc1(F f) { return -1.0f * f(3.3f) + 666.0f; }
float calc2(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }
int main() {
using namespace std::chrono;
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
calc1([](float arg){ return arg * 0.5f; });
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
return 0;
}
111 ms vs 1241 ms. I assume this is because templates can be nicely inlined, while functions cover the internals via virtual calls.
Obviously templates have their issues as I see them:
they have to be provided as headers which is not something you might not wish to do when releasing your library as a closed code,
they may make the compilation time much longer unless extern template-like policy is introduced,
there is no (at least known to me) clean way of representing requirements (concepts, anyone?) of a template, bar a comment describing what kind of functor is expected.
Can I thus assume that functions can be used as de facto standard of passing functors, and in places where high performance is expected templates should be used?
Edit:
My compiler is the Visual Studio 2012 without CTP.
In general, if you are facing a design situation that gives you a choice, use templates. I stressed the word design because I think what you need to focus on is the distinction between the use cases of std::function and templates, which are pretty different.
In general, the choice of templates is just an instance of a wider principle: try to specify as many constraints as possible at compile-time. The rationale is simple: if you can catch an error, or a type mismatch, even before your program is generated, you won't ship a buggy program to your customer.
Moreover, as you correctly pointed out, calls to template functions are resolved statically (i.e. at compile time), so the compiler has all the necessary information to optimize and possibly inline the code (which would not be possible if the call were performed through a vtable).
Yes, it is true that template support is not perfect, and C++11 is still lacking a support for concepts; however, I don't see how std::function would save you in that respect. std::function is not an alternative to templates, but rather a tool for design situations where templates cannot be used.
One such use case arises when you need to resolve a call at run-time by invoking a callable object that adheres to a specific signature, but whose concrete type is unknown at compile-time. This is typically the case when you have a collection of callbacks of potentially different types, but which you need to invoke uniformly; the type and number of the registered callbacks is determined at run-time based on the state of your program and the application logic. Some of those callbacks could be functors, some could be plain functions, some could be the result of binding other functions to certain arguments.
std::function and std::bind also offer a natural idiom for enabling functional programming in C++, where functions are treated as objects and get naturally curried and combined to generate other functions. Although this kind of combination can be achieved with templates as well, a similar design situation normally comes together with use cases that require to determine the type of the combined callable objects at run-time.
Finally, there are other situations where std::function is unavoidable, e.g. if you want to write recursive lambdas; however, these restrictions are more dictated by technological limitations than by conceptual distinctions I believe.
To sum up, focus on design and try to understand what are the conceptual use cases for these two constructs. If you put them into comparison the way you did, you are forcing them into an arena they likely don't belong to.
Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to std::function.
First of all, a quick remark on the measurement technique: The 11ms obtained for calc1 has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of calling calc1 is independent of the iteration and moves the call out of the loop:
for (int i = 0; i < 1e8; ++i) {
}
calc1([](float arg){ return arg * 0.5f; });
Furthermore, it realises that calling calc1 has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.
As it has been pointed out, the optimizer has more troubles to understand std::function and doesn't move the call out of the loop. So 1241ms is a fair measurement for calc2.
Notice that, std::function is able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call to new). It's well known that this is a quite costly operation.
The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).
To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three floats. This makes the callable object too big to apply the small object optimization:
float a, b, c; // never mind the values
// ...
calc2([a,b,c](float arg){ return arg * 0.5f; });
For this version, the time is approximately 16000ms (compared to 1241ms for the original code).
Finally, notice that the lifetime of the lambda encloses that of the std::function. In this case, rather than storing a copy of the lambda, std::function could store a "reference" to it. By "reference" I mean a std::reference_wrapper which is easily build by functions std::ref and std::cref. More precisely, by using:
auto func = [a,b,c](float arg){ return arg * 0.5f; };
calc2(std::cref(func));
the time decreases to approximately 1860ms.
I wrote about that a while ago:
http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059
As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.
With Clang there's no performance difference between the two
Using clang (3.2, trunk 166872) (-O2 on Linux), the binaries from the two cases are actually identical.
-I'll come back to clang at the end of the post. But first, gcc 4.7.2:
There's already a lot of insight going on, but I want to point out that the result of the calculations of calc1 and calc2 are not the same, due to in-lining etc. Compare for example the sum of all results:
float result=0;
for (int i = 0; i < 1e8; ++i) {
result+=calc2([](float arg){ return arg * 0.5f; });
}
with calc2 that becomes
1.71799e+10, time spent 0.14 sec
while with calc1 it becomes
6.6435e+10, time spent 5.772 sec
that's a factor of ~40 in speed difference, and a factor of ~4 in the values. The first is a much bigger difference than what OP posted (using visual studio). Actually printing out the value a the end is also a good idea to prevent the compiler to removing code with no visible result (as-if rule). Cassio Neri already said this in his answer. Note how different the results are -- One should be careful when comparing speed factors of codes that perform different calculations.
Also, to be fair, comparing various ways of repeatedly calculating f(3.3) is perhaps not that interesting. If the input is constant it should not be in a loop. (It's easy for the optimizer to notice)
If I add a user supplied value argument to calc1 and 2 the speed factor between calc1 and calc2 comes down to a factor of 5, from 40! With visual studio the difference is close to a factor of 2, and with clang there is no difference (see below).
Also, as multiplications are fast, talking about factors of slow-down is often not that interesting. A more interesting question is, how small are your functions, and are these calls the bottleneck in a real program?
Clang:
Clang (I used 3.2) actually produced identical binaries when I flip between calc1 and calc2 for the example code (posted below). With the original example posted in the question both are also identical but take no time at all (the loops are just completely removed as described above). With my modified example, with -O2:
Number of seconds to execute (best of 3):
clang: calc1: 1.4 seconds
clang: calc2: 1.4 seconds (identical binary)
gcc 4.7.2: calc1: 1.1 seconds
gcc 4.7.2: calc2: 6.0 seconds
VS2012 CTPNov calc1: 0.8 seconds
VS2012 CTPNov calc2: 2.0 seconds
VS2015 (14.0.23.107) calc1: 1.1 seconds
VS2015 (14.0.23.107) calc2: 1.5 seconds
MinGW (4.7.2) calc1: 0.9 seconds
MinGW (4.7.2) calc2: 20.5 seconds
The calculated results of all binaries are the same, and all tests were executed on the same machine. It would be interesting if someone with deeper clang or VS knowledge could comment on what optimizations may have been done.
My modified test code:
#include <functional>
#include <chrono>
#include <iostream>
template <typename F>
float calc1(F f, float x) {
return 1.0f + 0.002*x+f(x*1.223) ;
}
float calc2(std::function<float(float)> f,float x) {
return 1.0f + 0.002*x+f(x*1.223) ;
}
int main() {
using namespace std::chrono;
const auto tp1 = high_resolution_clock::now();
float result=0;
for (int i = 0; i < 1e8; ++i) {
result=calc1([](float arg){
return arg * 0.5f;
},result);
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
std::cout << result<< std::endl;
return 0;
}
Update:
Added vs2015. I also noticed that there are double->float conversions in calc1,calc2. Removing them does not change the conclusion for visual studio (both are a lot faster but the ratio is about the same).
Different isn't the same.
It's slower because it does things that a template can't do. In particular, it lets you call any function that can be called with the given argument types and whose return type is convertible to the given return type from the same code.
void eval(const std::function<int(int)>& f) {
std::cout << f(3);
}
int f1(int i) {
return i;
}
float f2(double d) {
return d;
}
int main() {
std::function<int(int)> fun(f1);
eval(fun);
fun = f2;
eval(fun);
return 0;
}
Note that the same function object, fun, is being passed to both calls to eval. It holds two different functions.
If you don't need to do that, then you should not use std::function.
You already have some good answers here, so I'm not going to contradict them, in short comparing std::function to templates is like comparing virtual functions to functions.
You never should "prefer" virtual functions to functions, but rather you use virtual functions when it fits the problem, moving decisions from compile time to run time. The idea is that rather than having to solve the problem using a bespoke solution (like a jump-table) you use something that gives the compiler a better chance of optimizing for you. It also helps other programmers, if you use a standard solution.
This answer is intended to contribute, to the set of existing answers, what I believe to be a more meaningful benchmark for the runtime cost of std::function calls.
The std::function mechanism should be recognized for what it provides: Any callable entity can be converted to a std::function of appropriate signature. Suppose you have a library that fits a surface to a function defined by z = f(x,y), you can write it to accept a std::function<double(double,double)>, and the user of the library can easily convert any callable entity to that; be it an ordinary function, a method of a class instance, or a lambda, or anything that is supported by std::bind.
Unlike template approaches, this works without having to recompile the library function for different cases; accordingly, little extra compiled code is needed for each additional case. It has always been possible to make this happen, but it used to require some awkward mechanisms, and the user of the library would likely need to construct an adapter around their function to make it work. std::function automatically constructs whatever adapter is needed to get a common runtime call interface for all the cases, which is a new and very powerful feature.
To my view, this is the most important use case for std::function as far as performance is concerned: I'm interested in the cost of calling a std::function many times after it has been constructed once, and it needs to be a situation where the compiler is unable to optimize the call by knowing the function actually being called (i.e. you need to hide the implementation in another source file to get a proper benchmark).
I made the test below, similar to the OP's; but the main changes are:
Each case loops 1 billion times, but the std::function objects are constructed only once. I've found by looking at the output code that 'operator new' is called when constructing actual std::function calls (maybe not when they are optimized out).
Test is split into two files to prevent undesired optimization
My cases are: (a) function is inlined (b) function is passed by an ordinary function pointer (c) function is a compatible function wrapped as std::function (d) function is an incompatible function made compatible with a std::bind, wrapped as std::function
The results I get are:
case (a) (inline) 1.3 nsec
all other cases: 3.3 nsec.
Case (d) tends to be slightly slower, but the difference (about 0.05 nsec) is absorbed in the noise.
Conclusion is that the std::function is comparable overhead (at call time) to using a function pointer, even when there's simple 'bind' adaptation to the actual function. The inline is 2 ns faster than the others but that's an expected tradeoff since the inline is the only case which is 'hard-wired' at run time.
When I run johan-lundberg's code on the same machine, I'm seeing about 39 nsec per loop, but there's a lot more in the loop there, including the actual constructor and destructor of the std::function, which is probably fairly high since it involves a new and delete.
-O2 gcc 4.8.1, to x86_64 target (core i5).
Note, the code is broken up into two files, to prevent the compiler from expanding the functions where they are called (except in the one case where it's intended to).
----- first source file --------------
#include <functional>
// simple funct
float func_half( float x ) { return x * 0.5; }
// func we can bind
float mul_by( float x, float scale ) { return x * scale; }
//
// func to call another func a zillion times.
//
float test_stdfunc( std::function<float(float)> const & func, int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with a function pointer
float test_funcptr( float (*func)(float), int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with inline function
float test_inline( int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func_half(x);
}
return y;
}
----- second source file -------------
#include <iostream>
#include <functional>
#include <chrono>
extern float func_half( float x );
extern float mul_by( float x, float scale );
extern float test_inline( int nloops );
extern float test_stdfunc( std::function<float(float)> const & func, int nloops );
extern float test_funcptr( float (*func)(float), int nloops );
int main() {
using namespace std::chrono;
for(int icase = 0; icase < 4; icase ++ ){
const auto tp1 = system_clock::now();
float result;
switch( icase ){
case 0:
result = test_inline( 1e9);
break;
case 1:
result = test_funcptr( func_half, 1e9);
break;
case 2:
result = test_stdfunc( func_half, 1e9);
break;
case 3:
result = test_stdfunc( std::bind( mul_by, std::placeholders::_1, 0.5), 1e9);
break;
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
std::cout << result<< std::endl;
}
return 0;
}
For those interested, here's the adaptor the compiler built to make 'mul_by' look like a float(float) - this is 'called' when the function created as bind(mul_by,_1,0.5) is called:
movq (%rdi), %rax ; get the std::func data
movsd 8(%rax), %xmm1 ; get the bound value (0.5)
movq (%rax), %rdx ; get the function to call (mul_by)
cvtpd2ps %xmm1, %xmm1 ; convert 0.5 to 0.5f
jmp *%rdx ; jump to the func
(so it might have been a bit faster if I'd written 0.5f in the bind...)
Note that the 'x' parameter arrives in %xmm0 and just stays there.
Here's the code in the area where the function is constructed, prior to calling test_stdfunc - run through c++filt :
movl $16, %edi
movq $0, 32(%rsp)
call operator new(unsigned long) ; get 16 bytes for std::function
movsd .LC0(%rip), %xmm1 ; get 0.5
leaq 16(%rsp), %rdi ; (1st parm to test_stdfunc)
movq mul_by(float, float), (%rax) ; store &mul_by in std::function
movl $1000000000, %esi ; (2nd parm to test_stdfunc)
movsd %xmm1, 8(%rax) ; store 0.5 in std::function
movq %rax, 16(%rsp) ; save ptr to allocated mem
;; the next two ops store pointers to generated code related to the std::function.
;; the first one points to the adaptor I showed above.
movq std::_Function_handler<float (float), std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_invoke(std::_Any_data const&, float), 40(%rsp)
movq std::_Function_base::_Base_manager<std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation), 32(%rsp)
call test_stdfunc(std::function<float (float)> const&, int)
In case you use a template instead of std::function in C++20 you can actually write your own concept with variadic templates for it (inspired by Hendrik Niemeyer's talk about C++20 concepts):
template<class Func, typename Ret, typename... Args>
concept functor = std::regular_invocable<Func, Args...> &&
std::same_as<std::invoke_result_t<Func, Args...>, Ret>;
You can then use it as functor<Ret, Args...> F> where Ret is the return value and Args... are the variadic input arguments. E.g. functor<double,int> F such as
template <functor<double,int> F>
auto CalculateSomething(F&& f, int const arg) {
return f(arg)*f(arg);
}
requires a functor as template argument which has to overload the () operator and has a double return value and a single input argument of type int. Similarly functor<double> would be a functor with double return type which does not take any input arguments.
Try it here!
You can also use it with variadic functions such as
template <typename... Args, functor<double, Args...> F>
auto CalculateSomething(F&& f, Args... args) {
return f(args...)*f(args...);
}
Try it here!
I found your results very interesting so I did a bit of digging to understand what is going on. First off as many others have said with out having the results of the computation effect the state of the program the compiler will just optimize this away. Secondly having a constant 3.3 given as an armament to the callback I suspect that there will be other optimizations going on. With that in mind I changed your benchmark code a little bit.
template <typename F>
float calc1(F f, float i) { return -1.0f * f(i) + 666.0f; }
float calc2(std::function<float(float)> f, float i) { return -1.0f * f(i) + 666.0f; }
int main() {
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
t += calc2([&](float arg){ return arg * 0.5f + t; }, i);
}
const auto tp2 = high_resolution_clock::now();
}
Given this change to the code I compiled with gcc 4.8 -O3 and got a time of 330ms for calc1 and 2702 for calc2. So using the template was 8 times faster, this number looked suspects to me, speed of a power of 8 often indicates that the compiler has vectorized something. when I looked at the generated code for the templates version it was clearly vectoreized
.L34:
cvtsi2ss %edx, %xmm0
addl $1, %edx
movaps %xmm3, %xmm5
mulss %xmm4, %xmm0
addss %xmm1, %xmm0
subss %xmm0, %xmm5
movaps %xmm5, %xmm0
addss %xmm1, %xmm0
cvtsi2sd %edx, %xmm1
ucomisd %xmm1, %xmm2
ja .L37
movss %xmm0, 16(%rsp)
Where as the std::function version was not. This makes sense to me, since with the template the compiler knows for sure that the function will never change throughout the loop but with the std::function being passed in it could change, therefor can not be vectorized.
This led me to try something else to see if I could get the compiler to perform the same optimization on the std::function version. Instead of passing in a function I make a std::function as a global var, and have this called.
float calc3(float i) { return -1.0f * f2(i) + 666.0f; }
std::function<float(float)> f2 = [](float arg){ return arg * 0.5f; };
int main() {
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
t += calc3([&](float arg){ return arg * 0.5f + t; }, i);
}
const auto tp2 = high_resolution_clock::now();
}
With this version we see that the compiler has now vectorized the code in the same way and I get the same benchmark results.
template : 330ms
std::function : 2702ms
global std::function: 330ms
So my conclusion is the raw speed of a std::function vs a template functor is pretty much the same. However it makes the job of the optimizer much more difficult.