Run-time overhead with boost.units?

Run-time overhead with boost.units? - c++

I'm seeing some 10% run-time overhead when using a clone of a constexpr enhanced boost.units with the float value type using clang and -O3 level optimization. This is showing up with some of the more elaborate applications of a library that I've been working on. Given this situation, I have two questions that I'd really like to solve and would love help with:
Boost units is supposed to be a zero-overhead library so why am I seeing the overhead?
More importantly, besides not using boost.units, how can I get the overhead to go away?
Details...
I've been working on an interactive physics engine written in C++14. With the many different physical quantities and units it uses, I love using the compile-time enforced units and quantities that boost.units provides. Unfortunately enabling boost units seems to be coming with this run-time cost. The engine comes with a benchmark application that uses google's benchmark library to provide this insight and it takes some of the more elaborate simulations to see the overhead.
At present, due to the overhead, the engine builds by default without using boost units. By defining the right preprocessor macro name, the engine can be built with boost units. I achieved this switching using code like the following:
// #define USE_BOOST_UNITS
#if defined(USE_BOOST_UNITS)
...
#include <boost/units/systems/si/time.hpp>
...
#endif // defined(USE_BOOST_UNITS)
#if defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) boost::units::quantity<BoostDimension, float>
#define UNIT(Quantity, BoostUnit) Quantity{BoostUnit * float{1}}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) Quantity{BoostUnit * float{Ratio}}
#else // defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) float
#define UNIT(Quantity, BoostUnit) float{1}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) float{Ratio}}
#endif // defined(USE_BOOST_UNITS)
using Time = QUANTITY(boost::units::si::time);
constexpr auto Second = UNIT(Time, boost::units::si::second);
What I did with the UNIT macro feels a bit suspect to me in that it's taking a boost unit type and turning it into a value. That makes switching between using or not using boost units easier however since either way expressions like 3.0f * Second compile without warning. Checking what clang and gcc do with expressions like these appeared to confirm that they were smart enough to avoid run-time multiplying 3.0f * 1.0f and just recognized the expression as 3.0f. I wonder anyway if that's the cause of the overhead or if it's something else that I've done.
I've also wondered if maybe the problem is rooted in the constexpr enhancement code I'm using or if the author(s) of that code had any idea about this overhead. On search the internet, I found a mention of overhead with the normal boost units library so seems safe to assume the enhanced units are not at fault. A suggestion that came out of my inquiring though (and my thanks go to GitHub user muggenhor for it) was the following:
I expect this is likely caused by the amount of inlining done by the compiler. Because of the wrapper functions for the operators this adds at least one function call that needs to be inlined per operation. For expressions depending on the result of sub-expressions this requires the sub-expressions to be inlined first. As a result I expect the minimum amount of inlining passes to be able to properly optimize your code to be equal to the depth of the produced expression tree...
This sounds like a pretty viable theory to me. Unfortunately, I don't know how to test it and admittedly I'm more fond of digging into my own code at the moment than into clang/LLVM code. I've tried using -inline-threshold=10000 but that doesn't seem to make the overhead go away. To my understanding of clang at least, I don't believe that specifically increases the number of inlining passes. Is there another command line argument that does? Or are there parameters within clang's sources that someone can point me to looking at as a starting point to maybe recompiling clang and trying the modified compiler?
Another theory I've had is whether using float is the problem. I can rebuild my physics engine to use double instead and compare benchmark results between building with and without the boost units support enabled. What I find when using double is that the overhead at least seems to decrease. I've wondered if maybe boost units is somewhere using double even when I use float in its quantity template and maybe that's causing the overhead.
Lastly, I built boost unit's performance example with the constexpr enhancements and ran it with both double and float. Got no reliable sign of any overhead which seems to eliminate my theory of float being the problem.
Update With Data & Code
Got some more isolated data and code on this where it seems I'm seeing significantly more than 10% overhead...
Some benchmark data where Length is basically boost::units::si::length:
LesserLength/1000 953 ns 953 ns 724870
LesserFloat/1000 590 ns 590 ns 1093647
LesserDouble/1000 619 ns 618 ns 1198938
What the related code looks like:
static void LesserLength(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f * playrho::Meter, 100.0f * playrho::Meter);
auto c = 0.0f * playrho::Meter;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
static_assert(std::is_same<decltype(b), const playrho::Length>::value, "not Length");
const auto v = (a < b)? a: b;
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserFloat(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f, 100.0f);
auto c = 0.0f;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const float>::value, "not float");
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserDouble(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0, 100.0);
auto c = 0.0;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const double>::value, "not double");
benchmark::DoNotOptimize(c = v);
}
}
}
With this as a hint to me, I checked Godbolt with the following code to see what clang 5.0.0 and gcc 7.2 would generate:
#include <algorithm>
#include <boost/units/systems/si/length.hpp>
#include <boost/units/cmath.hpp>
using length = boost::units::quantity<boost::units::si::length, float>;
float f(float a, float b)
{
return a < b? a: b;
}
length f(length a, length b)
{
return a < b? a: b;
}
I see that the generated assembly looks quite different between the two functions and between clang and gcc. Here's a gist of the relevant assembly from clang (with the boost stuff here simply shown as length):
f(float, float): # #f(float, float)
minss xmm0, xmm1
ret
f(length, length)
movss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero
ucomiss xmm0, dword ptr [rsi]
cmova rdx, rsi
mov eax, dword ptr [rdx]
mov dword ptr [rdi], eax
mov rax, rdi
ret
Shouldn't both of these compilers using -O3 optimization be returning the same assembly though for the length version as they do for the float version? Is the problem that they're not quite optimizing down all the way to the same code as for float? Seems like this is the problem and if so that's progress but I still want to figure out what can be done to really get zero overhead.

Related

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array.
On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use clang's bool vector extension in combination with __builtin_convertvector, I'm happy with the results:
uint64_t handvectorized(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
using VecBool64 __attribute__((vector_size(64))) = char;
using VecBitmaskT __attribute__((ext_vector_type(64))) = bool;
auto input_vec = *reinterpret_cast<const VecBool64*>(input);
auto result_vec = __builtin_convertvector(input_vec, VecBitmaskT);
return reinterpret_cast<uint64_t&>(result_vec);
}
produces (godbolt):
vmovdqa64 zmm0, zmmword ptr [rdi]
vptestmb k0, zmm0, zmm0
kmovq rax, k0
vzeroupper
ret
However, I can not get clang (or GCC, or ICX) to produce anything that uses vector mask extraction like this with (portable) scalar code.
For this implementation:
uint64_t loop(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
uint64_t result = 0;
for(size_t i = 0; i < 64; ++i) {
if(input[i]) {
result |= 1ull << i;
}
}
return result;
}
clang produces a 64*8B = 512B lookup table and 39 instructions.
This implementation, and some other scalar implementations (branchless, inverse bit order, using std::bitset) that I've tried, can all be found on godbolt. None of them results in code close to the handwritten vector instructions.
Is there anything I'm missing or any reason the optimization doesn't work well here? Is there a scalar version I can write that produces reasonably vectorized code?
I'm wondering especially since the "handvectorized" version doesn't use any platform-specific intrinsics, and doesn't really have much programming to it. All it does is "load as vector" and "convert to bitmask". Perhaps clang simply doesn't detect the loop pattern? It just feels strange to me, a simple bitwise OR reduction loop feels like a common pattern, and the documentation of the loop vectorizer explicitly lists reductions using OR as a supported feature.
Edit: Updated godbolt link with suggestions from the comments

Why don't C++ compilers do better constant folding?

I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.
This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:
#include <Eigen/Core>
#include <Eigen/Geometry>
using Array12d = Eigen::Matrix<double,12,1>;
double testReturnFirstDot(const Array12d& b)
{
Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);
}
Which should be the same as
double testReturnFirst(const Array12d& b)
{
return b(0);
}
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.
I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.
While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.

This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.
So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.
If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:
static const int DontVectorize = 0x80000000;
namespace Eigen {
namespace internal {
template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >
{
typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
enum {
EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
};
};
}
}
using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;
Full example there: https://godbolt.org/z/bOEyzv

I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.
They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:
-0.0 + 0.0 = 0.0 // Not -0.0!
Similarly, multiplying by zero does not always yield zero:
0.0 * Infinity = NaN // Not 0.0!
So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.
You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
-ffinite-math-only, -fno-signed-zeros
Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.
As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY

One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use
#include <array>
#include <cstddef>
constexpr std::size_t n = 12;
using Array = std::array<double, n>;
Then we can implement a simple dot function using fold expressions (or recursion if they are not available):
<utility>
template<std::size_t... is>
double dot(const Array& x, const Array& y, std::index_sequence<is...>)
{
return ((x[is] * y[is]) + ...);
}
double dot(const Array& x, const Array& y)
{
return dot(x, y, std::make_index_sequence<n>{});
}
Now let's take a look at your function
double test(const Array& b)
{
const Array a{1}; // = {1, 0, ...}
return dot(a, b);
}
With -ffast-math gcc 8.2 produces:
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
ret
clang 6.0.0 goes along the same lines:
test(std::array<double, 12ul> const&): # #test(std::array<double, 12ul> const&)
movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
ret
For example, for
double test(const Array& b)
{
const Array a{1, 1}; // = {1, 1, 0...}
return dot(a, b);
}
we get
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
ret
Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.

Porting to Mac OS X error

I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?

The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.

How to conditionally set compiler optimization for template headers

I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.
So, to see what would happen, I've created a test project like this:
main.cpp
#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"
int main(int argc, char* argv[])
{
int number = 10; // this will come from input, but let's keep it simple for now
int result;
if (std::string(argv[argc - 1]) == "--noavx")
result = FnNormal(number);
else
{
std::cout << "AVX selected\n";
result = FnAVX(number);
}
std::cout << "Double of " << number << " is " << result << std::endl;
return 0;
}
Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:
fn_normal.cpp
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble(num);
}
fn_avx.cpp
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
And here's the template function definition:
double.h
template<typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.
I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.
Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.
As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?
My compiler is MSVC 2013.
Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.

The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:
objdump -d fn_normal.obj:
...
movss -0x1f5c(%ebp,%eax,4),%xmm0
mulss -0x1f5c(%ebp,%ecx,4),%xmm0
mov -0x1f68(%ebp),%edx
mulss -0x1f5c(%ebp,%edx,4),%xmm0
mov -0x1f68(%ebp),%eax
movss %xmm0,-0xfb4(%ebp,%eax,4)
...
objdump -d fn_avx.obj:
...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...
The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.
The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:
double.h:
template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}
fn_avx.cpp:
#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).
When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in #celtschk's answer).

Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.

I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.
In MSVC++:
template<typename T>
__forceinline int RtDouble(T number) {...}
GCC:
template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}
Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.

I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:
File double.h:
template<bool avx, typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
File fn_normal.cpp:
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble<false>(num);
}
File fn_avx.cpp:
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble<true>(num);
}

std::function vs template

Thanks to C++11 we received the std::function family of functor wrappers. Unfortunately, I keep hearing only bad things about these new additions. The most popular is that they are horribly slow. I tested it and they truly suck in comparison with templates.
#include <iostream>
#include <functional>
#include <string>
#include <chrono>
template <typename F>
float calc1(F f) { return -1.0f * f(3.3f) + 666.0f; }
float calc2(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }
int main() {
using namespace std::chrono;
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
calc1([](float arg){ return arg * 0.5f; });
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
return 0;
}
111 ms vs 1241 ms. I assume this is because templates can be nicely inlined, while functions cover the internals via virtual calls.
Obviously templates have their issues as I see them:
they have to be provided as headers which is not something you might not wish to do when releasing your library as a closed code,
they may make the compilation time much longer unless extern template-like policy is introduced,
there is no (at least known to me) clean way of representing requirements (concepts, anyone?) of a template, bar a comment describing what kind of functor is expected.
Can I thus assume that functions can be used as de facto standard of passing functors, and in places where high performance is expected templates should be used?
Edit:
My compiler is the Visual Studio 2012 without CTP.

In general, if you are facing a design situation that gives you a choice, use templates. I stressed the word design because I think what you need to focus on is the distinction between the use cases of std::function and templates, which are pretty different.
In general, the choice of templates is just an instance of a wider principle: try to specify as many constraints as possible at compile-time. The rationale is simple: if you can catch an error, or a type mismatch, even before your program is generated, you won't ship a buggy program to your customer.
Moreover, as you correctly pointed out, calls to template functions are resolved statically (i.e. at compile time), so the compiler has all the necessary information to optimize and possibly inline the code (which would not be possible if the call were performed through a vtable).
Yes, it is true that template support is not perfect, and C++11 is still lacking a support for concepts; however, I don't see how std::function would save you in that respect. std::function is not an alternative to templates, but rather a tool for design situations where templates cannot be used.
One such use case arises when you need to resolve a call at run-time by invoking a callable object that adheres to a specific signature, but whose concrete type is unknown at compile-time. This is typically the case when you have a collection of callbacks of potentially different types, but which you need to invoke uniformly; the type and number of the registered callbacks is determined at run-time based on the state of your program and the application logic. Some of those callbacks could be functors, some could be plain functions, some could be the result of binding other functions to certain arguments.
std::function and std::bind also offer a natural idiom for enabling functional programming in C++, where functions are treated as objects and get naturally curried and combined to generate other functions. Although this kind of combination can be achieved with templates as well, a similar design situation normally comes together with use cases that require to determine the type of the combined callable objects at run-time.
Finally, there are other situations where std::function is unavoidable, e.g. if you want to write recursive lambdas; however, these restrictions are more dictated by technological limitations than by conceptual distinctions I believe.
To sum up, focus on design and try to understand what are the conceptual use cases for these two constructs. If you put them into comparison the way you did, you are forcing them into an arena they likely don't belong to.

Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to std::function.
First of all, a quick remark on the measurement technique: The 11ms obtained for calc1 has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of calling calc1 is independent of the iteration and moves the call out of the loop:
for (int i = 0; i < 1e8; ++i) {
}
calc1([](float arg){ return arg * 0.5f; });
Furthermore, it realises that calling calc1 has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.
As it has been pointed out, the optimizer has more troubles to understand std::function and doesn't move the call out of the loop. So 1241ms is a fair measurement for calc2.
Notice that, std::function is able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call to new). It's well known that this is a quite costly operation.
The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).
To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three floats. This makes the callable object too big to apply the small object optimization:
float a, b, c; // never mind the values
// ...
calc2([a,b,c](float arg){ return arg * 0.5f; });
For this version, the time is approximately 16000ms (compared to 1241ms for the original code).
Finally, notice that the lifetime of the lambda encloses that of the std::function. In this case, rather than storing a copy of the lambda, std::function could store a "reference" to it. By "reference" I mean a std::reference_wrapper which is easily build by functions std::ref and std::cref. More precisely, by using:
auto func = [a,b,c](float arg){ return arg * 0.5f; };
calc2(std::cref(func));
the time decreases to approximately 1860ms.
I wrote about that a while ago:
http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059
As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.

With Clang there's no performance difference between the two
Using clang (3.2, trunk 166872) (-O2 on Linux), the binaries from the two cases are actually identical.
-I'll come back to clang at the end of the post. But first, gcc 4.7.2:
There's already a lot of insight going on, but I want to point out that the result of the calculations of calc1 and calc2 are not the same, due to in-lining etc. Compare for example the sum of all results:
float result=0;
for (int i = 0; i < 1e8; ++i) {
result+=calc2([](float arg){ return arg * 0.5f; });
}
with calc2 that becomes
1.71799e+10, time spent 0.14 sec
while with calc1 it becomes
6.6435e+10, time spent 5.772 sec
that's a factor of ~40 in speed difference, and a factor of ~4 in the values. The first is a much bigger difference than what OP posted (using visual studio). Actually printing out the value a the end is also a good idea to prevent the compiler to removing code with no visible result (as-if rule). Cassio Neri already said this in his answer. Note how different the results are -- One should be careful when comparing speed factors of codes that perform different calculations.
Also, to be fair, comparing various ways of repeatedly calculating f(3.3) is perhaps not that interesting. If the input is constant it should not be in a loop. (It's easy for the optimizer to notice)
If I add a user supplied value argument to calc1 and 2 the speed factor between calc1 and calc2 comes down to a factor of 5, from 40! With visual studio the difference is close to a factor of 2, and with clang there is no difference (see below).
Also, as multiplications are fast, talking about factors of slow-down is often not that interesting. A more interesting question is, how small are your functions, and are these calls the bottleneck in a real program?
Clang:
Clang (I used 3.2) actually produced identical binaries when I flip between calc1 and calc2 for the example code (posted below). With the original example posted in the question both are also identical but take no time at all (the loops are just completely removed as described above). With my modified example, with -O2:
Number of seconds to execute (best of 3):
clang: calc1: 1.4 seconds
clang: calc2: 1.4 seconds (identical binary)
gcc 4.7.2: calc1: 1.1 seconds
gcc 4.7.2: calc2: 6.0 seconds
VS2012 CTPNov calc1: 0.8 seconds
VS2012 CTPNov calc2: 2.0 seconds
VS2015 (14.0.23.107) calc1: 1.1 seconds
VS2015 (14.0.23.107) calc2: 1.5 seconds
MinGW (4.7.2) calc1: 0.9 seconds
MinGW (4.7.2) calc2: 20.5 seconds
The calculated results of all binaries are the same, and all tests were executed on the same machine. It would be interesting if someone with deeper clang or VS knowledge could comment on what optimizations may have been done.
My modified test code:
#include <functional>
#include <chrono>
#include <iostream>
template <typename F>
float calc1(F f, float x) {
return 1.0f + 0.002*x+f(x*1.223) ;
}
float calc2(std::function<float(float)> f,float x) {
return 1.0f + 0.002*x+f(x*1.223) ;
}
int main() {
using namespace std::chrono;
const auto tp1 = high_resolution_clock::now();
float result=0;
for (int i = 0; i < 1e8; ++i) {
result=calc1([](float arg){
return arg * 0.5f;
},result);
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
std::cout << result<< std::endl;
return 0;
}
Update:
Added vs2015. I also noticed that there are double->float conversions in calc1,calc2. Removing them does not change the conclusion for visual studio (both are a lot faster but the ratio is about the same).

Different isn't the same.
It's slower because it does things that a template can't do. In particular, it lets you call any function that can be called with the given argument types and whose return type is convertible to the given return type from the same code.
void eval(const std::function<int(int)>& f) {
std::cout << f(3);
}
int f1(int i) {
return i;
}
float f2(double d) {
return d;
}
int main() {
std::function<int(int)> fun(f1);
eval(fun);
fun = f2;
eval(fun);
return 0;
}
Note that the same function object, fun, is being passed to both calls to eval. It holds two different functions.
If you don't need to do that, then you should not use std::function.

You already have some good answers here, so I'm not going to contradict them, in short comparing std::function to templates is like comparing virtual functions to functions.
You never should "prefer" virtual functions to functions, but rather you use virtual functions when it fits the problem, moving decisions from compile time to run time. The idea is that rather than having to solve the problem using a bespoke solution (like a jump-table) you use something that gives the compiler a better chance of optimizing for you. It also helps other programmers, if you use a standard solution.

This answer is intended to contribute, to the set of existing answers, what I believe to be a more meaningful benchmark for the runtime cost of std::function calls.
The std::function mechanism should be recognized for what it provides: Any callable entity can be converted to a std::function of appropriate signature. Suppose you have a library that fits a surface to a function defined by z = f(x,y), you can write it to accept a std::function<double(double,double)>, and the user of the library can easily convert any callable entity to that; be it an ordinary function, a method of a class instance, or a lambda, or anything that is supported by std::bind.
Unlike template approaches, this works without having to recompile the library function for different cases; accordingly, little extra compiled code is needed for each additional case. It has always been possible to make this happen, but it used to require some awkward mechanisms, and the user of the library would likely need to construct an adapter around their function to make it work. std::function automatically constructs whatever adapter is needed to get a common runtime call interface for all the cases, which is a new and very powerful feature.
To my view, this is the most important use case for std::function as far as performance is concerned: I'm interested in the cost of calling a std::function many times after it has been constructed once, and it needs to be a situation where the compiler is unable to optimize the call by knowing the function actually being called (i.e. you need to hide the implementation in another source file to get a proper benchmark).
I made the test below, similar to the OP's; but the main changes are:
Each case loops 1 billion times, but the std::function objects are constructed only once. I've found by looking at the output code that 'operator new' is called when constructing actual std::function calls (maybe not when they are optimized out).
Test is split into two files to prevent undesired optimization
My cases are: (a) function is inlined (b) function is passed by an ordinary function pointer (c) function is a compatible function wrapped as std::function (d) function is an incompatible function made compatible with a std::bind, wrapped as std::function
The results I get are:
case (a) (inline) 1.3 nsec
all other cases: 3.3 nsec.
Case (d) tends to be slightly slower, but the difference (about 0.05 nsec) is absorbed in the noise.
Conclusion is that the std::function is comparable overhead (at call time) to using a function pointer, even when there's simple 'bind' adaptation to the actual function. The inline is 2 ns faster than the others but that's an expected tradeoff since the inline is the only case which is 'hard-wired' at run time.
When I run johan-lundberg's code on the same machine, I'm seeing about 39 nsec per loop, but there's a lot more in the loop there, including the actual constructor and destructor of the std::function, which is probably fairly high since it involves a new and delete.
-O2 gcc 4.8.1, to x86_64 target (core i5).
Note, the code is broken up into two files, to prevent the compiler from expanding the functions where they are called (except in the one case where it's intended to).
----- first source file --------------
#include <functional>
// simple funct
float func_half( float x ) { return x * 0.5; }
// func we can bind
float mul_by( float x, float scale ) { return x * scale; }
//
// func to call another func a zillion times.
//
float test_stdfunc( std::function<float(float)> const & func, int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with a function pointer
float test_funcptr( float (*func)(float), int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func(x);
}
return y;
}
// same thing with inline function
float test_inline( int nloops ) {
float x = 1.0;
float y = 0.0;
for(int i =0; i < nloops; i++ ){
y += x;
x = func_half(x);
}
return y;
}
----- second source file -------------
#include <iostream>
#include <functional>
#include <chrono>
extern float func_half( float x );
extern float mul_by( float x, float scale );
extern float test_inline( int nloops );
extern float test_stdfunc( std::function<float(float)> const & func, int nloops );
extern float test_funcptr( float (*func)(float), int nloops );
int main() {
using namespace std::chrono;
for(int icase = 0; icase < 4; icase ++ ){
const auto tp1 = system_clock::now();
float result;
switch( icase ){
case 0:
result = test_inline( 1e9);
break;
case 1:
result = test_funcptr( func_half, 1e9);
break;
case 2:
result = test_stdfunc( func_half, 1e9);
break;
case 3:
result = test_stdfunc( std::bind( mul_by, std::placeholders::_1, 0.5), 1e9);
break;
}
const auto tp2 = high_resolution_clock::now();
const auto d = duration_cast<milliseconds>(tp2 - tp1);
std::cout << d.count() << std::endl;
std::cout << result<< std::endl;
}
return 0;
}
For those interested, here's the adaptor the compiler built to make 'mul_by' look like a float(float) - this is 'called' when the function created as bind(mul_by,_1,0.5) is called:
movq (%rdi), %rax ; get the std::func data
movsd 8(%rax), %xmm1 ; get the bound value (0.5)
movq (%rax), %rdx ; get the function to call (mul_by)
cvtpd2ps %xmm1, %xmm1 ; convert 0.5 to 0.5f
jmp *%rdx ; jump to the func
(so it might have been a bit faster if I'd written 0.5f in the bind...)
Note that the 'x' parameter arrives in %xmm0 and just stays there.
Here's the code in the area where the function is constructed, prior to calling test_stdfunc - run through c++filt :
movl $16, %edi
movq $0, 32(%rsp)
call operator new(unsigned long) ; get 16 bytes for std::function
movsd .LC0(%rip), %xmm1 ; get 0.5
leaq 16(%rsp), %rdi ; (1st parm to test_stdfunc)
movq mul_by(float, float), (%rax) ; store &mul_by in std::function
movl $1000000000, %esi ; (2nd parm to test_stdfunc)
movsd %xmm1, 8(%rax) ; store 0.5 in std::function
movq %rax, 16(%rsp) ; save ptr to allocated mem
;; the next two ops store pointers to generated code related to the std::function.
;; the first one points to the adaptor I showed above.
movq std::_Function_handler<float (float), std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_invoke(std::_Any_data const&, float), 40(%rsp)
movq std::_Function_base::_Base_manager<std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation), 32(%rsp)
call test_stdfunc(std::function<float (float)> const&, int)

In case you use a template instead of std::function in C++20 you can actually write your own concept with variadic templates for it (inspired by Hendrik Niemeyer's talk about C++20 concepts):
template<class Func, typename Ret, typename... Args>
concept functor = std::regular_invocable<Func, Args...> &&
std::same_as<std::invoke_result_t<Func, Args...>, Ret>;
You can then use it as functor<Ret, Args...> F> where Ret is the return value and Args... are the variadic input arguments. E.g. functor<double,int> F such as
template <functor<double,int> F>
auto CalculateSomething(F&& f, int const arg) {
return f(arg)*f(arg);
}
requires a functor as template argument which has to overload the () operator and has a double return value and a single input argument of type int. Similarly functor<double> would be a functor with double return type which does not take any input arguments.
Try it here!
You can also use it with variadic functions such as
template <typename... Args, functor<double, Args...> F>
auto CalculateSomething(F&& f, Args... args) {
return f(args...)*f(args...);
}
Try it here!

I found your results very interesting so I did a bit of digging to understand what is going on. First off as many others have said with out having the results of the computation effect the state of the program the compiler will just optimize this away. Secondly having a constant 3.3 given as an armament to the callback I suspect that there will be other optimizations going on. With that in mind I changed your benchmark code a little bit.
template <typename F>
float calc1(F f, float i) { return -1.0f * f(i) + 666.0f; }
float calc2(std::function<float(float)> f, float i) { return -1.0f * f(i) + 666.0f; }
int main() {
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
t += calc2([&](float arg){ return arg * 0.5f + t; }, i);
}
const auto tp2 = high_resolution_clock::now();
}
Given this change to the code I compiled with gcc 4.8 -O3 and got a time of 330ms for calc1 and 2702 for calc2. So using the template was 8 times faster, this number looked suspects to me, speed of a power of 8 often indicates that the compiler has vectorized something. when I looked at the generated code for the templates version it was clearly vectoreized
.L34:
cvtsi2ss %edx, %xmm0
addl $1, %edx
movaps %xmm3, %xmm5
mulss %xmm4, %xmm0
addss %xmm1, %xmm0
subss %xmm0, %xmm5
movaps %xmm5, %xmm0
addss %xmm1, %xmm0
cvtsi2sd %edx, %xmm1
ucomisd %xmm1, %xmm2
ja .L37
movss %xmm0, 16(%rsp)
Where as the std::function version was not. This makes sense to me, since with the template the compiler knows for sure that the function will never change throughout the loop but with the std::function being passed in it could change, therefor can not be vectorized.
This led me to try something else to see if I could get the compiler to perform the same optimization on the std::function version. Instead of passing in a function I make a std::function as a global var, and have this called.
float calc3(float i) { return -1.0f * f2(i) + 666.0f; }
std::function<float(float)> f2 = [](float arg){ return arg * 0.5f; };
int main() {
const auto tp1 = system_clock::now();
for (int i = 0; i < 1e8; ++i) {
t += calc3([&](float arg){ return arg * 0.5f + t; }, i);
}
const auto tp2 = high_resolution_clock::now();
}
With this version we see that the compiler has now vectorized the code in the same way and I get the same benchmark results.
template : 330ms
std::function : 2702ms
global std::function: 330ms
So my conclusion is the raw speed of a std::function vs a template functor is pretty much the same. However it makes the job of the optimizer much more difficult.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js