I'm trying to write an async logger which accepts variadic arguments that are then strung together using a variadic stringer and then pushed onto a single producer single consumer queue.
I'm stuck in my enqueue function part of my Log struct which looks as follows:
template <typename T>
std::string Log::stringer(T const & t){
return boost::lexical_cast<std::string>(t);
}
template<typename T, typename ... Args>
std::string Log::stringer(T const & t, Args const & ... args){
return stringer(t) + stringer(args...);
}
template<typename T, typename ... Args>
void Log::enqueue(T & t, Args & ... args){
boost::function<std::string()> f
= boost::bind(&Log::stringer<T &, Args & ...>,this,
boost::ref(t),
boost::forward<Args>(args)...);
/// the above statement fails to compile though if i use 'auto f' it works ->
/// but then it is unclear to me what the signature of f really is ?
// at this point i would like to post the functor f onto my asio::io_service,
// but not able to cause it's not clear to me what the type of f is.
// I think it should be of type boost::function<std::string()>
}
Inside main(), I call
Log t_log;
t_log.enqueue("hello"," world");
My suggestion for the function you ask about:
template <typename T, typename... Args> void enqueue(T &t, Args const&... args) {
this->io_service->post([=]{
auto s = stringer(t, args...);
//std::fprintf(stderr, "%s\n", s.c_str());
});
}
This works with GCC and Clang (GCC 4.9 or later because of a known issue with captured variadic packs).
But really, I'd reconsider the design at hand, and certainly start a lot simpler until you know what areas deserve further optimization.
Questionables
There are many things I don't understand about this code:
Why are the arguments being taken by non-const reference
Why are you subsequently using std::forward<> on them (you already now the value category, and it's not going to change)
Why are you passing the stringization to an io_service?
the queue is going to introduce locking (kind of refuting the lockfree queue) and
the stringization is going to have its result ignored...
Why would you use boost::function here? This incurs a (another) dynamic allocation and an indirect dispatch... Just post f
Why are the arguments bound by reference in the first place? If you're going to process the arguments on a different thread, this leads to Undefined Behaviour. E.g. imagine the caller doing
char const msg[] = "my message"; // perhaps some sprintf output
l.enqueue(cat.c_str(), msg);
The c_str() is stale after the enqueue returned and msg goes out of scope soon, or gets overwritten with other data.
Why are you using bind approaches when you clearly have c++11 support (because you used std::forward<> and attributes)?
Why are you using a lockfree queue (do anticipate to be constantly logging at max CPU? In that case, logging is the core functionality of you application and you should probably think this through a bit (a lot) more rigorously (e.g. write into preallocated alternating buffers and decide on max backlog etc).
In all other cases, you probably want at most 1 single thread running on a lockfree queue. This would likely already be overkill (spinning a thread constantly is expensive). Instead, you could gracefully fallback to yields/synchronization if there's nothing to do n cycles.
You can just bind to a shared_ptr. This is a lot safer and more convenient than binding to .get()
In my sample below I've just removed the need for scoped_ptrs by not allocating everything from the heap (why was that?). (You can use boost::optional<work> if you needed work.)
The explicit memory-order load/stores give me bad vibes too. The way they're written would make sense only if exactly two threads are involved in the flag, but this is in no way apparent to me at the moment (threads are created all around).
On most platforms there will be no difference, and in light of the above, the presence of explicit memory ordering stands out as a clear code smell
The same thing applies to the attempts to forcibly inline certain functions. You can trust your compiler and you should probably refrain from second guessing it until you know you have a bottleneck caused by suboptimal generated code
Since you intend to give threads thread affinity, do use thread locals. Either use GCC/MSVC extensions in C++03 (__thread) or use c++11 thread_local, e.g. in pop()
thread_local std::string s;
s.reserve(1000);
s.resize(0);
This enormously reduces the number of allocations (at the cost of making pop() non-reentrant, which is not required.
I later noticed this pop() is limited to a single thread
What is the use of having that lockfree queue if all you do is ... spinlock manually around it?
void push(std::string const &s) {
while (std::atomic_flag_test_and_set_explicit(&this->lock, std::memory_order_acquire))
;
while (!this->q->push(s))
;
std::atomic_flag_clear_explicit(&this->lock, std::memory_order_release);
}
Cleanup Suggestion
Live On Coliru
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/atomic.hpp>
#include <boost/lockfree/spsc_queue.hpp>
#include <boost/thread/thread.hpp>
/*
* safe for use from a single thread only
*/
template <unsigned line_maxchars = 1000>
class Log {
public:
Log(std::string const &logFileName, int32_t queueSize)
: fp(stderr), // std::fopen(logFileName.c_str(),"w")
_shutdown(false),
_thread(&Log::pop, this),
_queue(queueSize)
{ }
void pop() {
std::string s;
s.reserve(line_maxchars);
struct timeval ts;
while (!_shutdown) {
while (_queue.pop(s)) {
gettimeofday(&ts, NULL);
std::fprintf(fp, "%li.%06li %s\n", ts.tv_sec, ts.tv_usec, s.c_str());
}
std::fflush(fp); // RECONSIDER HERE?
}
while (_queue.pop(s)) {
gettimeofday(&ts, NULL);
std::fprintf(fp, "%li.%06li %s\n", ts.tv_sec, ts.tv_usec, s.c_str());
}
}
template <typename S, typename T> void stringer(S& stream, T const &t) {
stream << t;
}
template <typename S, typename T, typename... Args>
void stringer(S& stream, T const &t, Args const &... args) {
stringer(stream, t);
stringer(stream, args...);
}
template <typename T, typename... Args> void enqueue(T &t, Args const&... args) {
thread_local char buffer[line_maxchars] = {};
boost::iostreams::array_sink as(buffer);
boost::iostreams::stream<boost::iostreams::array_sink> stream(as);
stringer(stream, t, args...);
auto output = as.output_sequence();
push(std::string(output.first, output.second));
}
void push(std::string const &s) {
while (!_queue.push(s));
}
~Log() {
_shutdown = true;
_thread.join();
assert(_queue.empty());
std::fflush(fp);
std::fclose(fp);
fp = NULL;
}
private:
FILE *fp;
boost::atomic_bool _shutdown;
boost::thread _thread;
boost::lockfree::spsc_queue<std::string> _queue;
};
#include <chrono>
#include <iostream>
int main() {
using namespace std::chrono;
auto start = high_resolution_clock::now();
{
Log<> l("/tmp/junk.log", 1024);
for (int64_t i = 0; i < 10; ++i) {
l.enqueue("hello ", i, " world");
}
}
std::cout << duration_cast<microseconds>(high_resolution_clock::now() - start).count() << "μs\n";
}
As you can see, I've reduced the code by a third. I've documented the fact that it's only safe for use from a single thread.
Asio is gone. Lexical cast is gone. Things have meaningful names. No more memory order fiddling. No more thread affinity fiddling. No more inline envy. No more tedious string allocations.
The things that you'd likely benefit the most from is
make the array_sinks/buffers pooled and stored in the queue by reference
not flush on every log
Related
I have written a forward iterator that iterates over the nodes of a graph in order of a (preorder/postorder/inorder) DFS spanning tree. Since it is quite complicated compared to writing a simple DFS and calling a callback for each encountered node, I thought I could use C++20 coroutines to simplify the code of the iterator.
However, C++20 coroutines are not copyable (much less so if they are stateful) but iterators should better be copyable!
Is there any way I could still use some coroutine-like code for my iterator?
Note: I want all iterators to iterate independently from each other. If I copy an iterator and then call ++ on it, then only the iterator should have advanced, but not its copy.
I figured that, with the Miro Knejp "goto-hack", something resembling copyable co-routines is possible as follows (this toy example just counts "+1" "*2" until a certain value but it illustrates the point).
(1) this is just a simple wrapper for the actual function
template<class Func, class Data>
struct CopyableCoroutine {
Data data;
Func func;
CopyableCoroutine(const Data& _data, const Func& _func): data(_data), func(_func) {}
bool done() const { return data.done(); }
template<class... Args> decltype(auto) operator()(Args&&... args) {
return func(data, std::forward<Args>(args)...);
}
};
(2) this is where the magic happens
struct my_stack {
int n;
int next_label_idx = 0;
bool done() const { return n > 30; }
};
int wierd_count(my_stack& coro_data) {
static constexpr void* jump_table[] = {&&beginning, &&before_add, &&after_add, &&the_end};
goto* jump_table[coro_data.next_label_idx];
beginning:
while(coro_data.n <= 32) {
coro_data.next_label_idx = 1;
return coro_data.n;
before_add:
coro_data.next_label_idx = 2;
++coro_data.n; // odd steps: add one
return coro_data.n;
after_add:
coro_data.next_label_idx = 3;
coro_data.n *= 2; // even steps: multiply 2
}
the_end:
return -1;
}
play with it here.
NOTE: unfortunately, this "dirty hack" requires some extra work and I would be super happy if that could be avoided somehow. I'd really love to see C++ native coroutines that can be copied if the user promises that its stack is copyable.
This code is just for illustrating the question.
#include <functional>
struct MyCallBack {
void Fire() {
}
};
int main()
{
MyCallBack cb;
std::function<void(void)> func = std::bind(&MyCallBack::Fire, &cb);
}
Experiments with valgrind shows that the line assigning to func dynamically allocates about 24 bytes with gcc 7.1.1 on linux.
In the real code, I have a few handfuls of different structs all with a void(void) member function that gets stored in ~10 million std::function<void(void)>.
Is there any way I can avoid memory being dynamically allocated when doing std::function<void(void)> func = std::bind(&MyCallBack::Fire, &cb); ? (Or otherwise assigning these member function to a std::function)
Unfortunately, allocators for std::function has been dropped in C++17.
Now the accepted solution to avoid dynamic allocations inside std::function is to use lambdas instead of std::bind. That does work, at least in GCC - it has enough static space to store the lambda in your case, but not enough space to store the binder object.
std::function<void()> func = [&cb]{ cb.Fire(); };
// sizeof lambda is sizeof(MyCallBack*), which is small enough
As a general rule, with most implementations, and with a lambda which captures only a single pointer (or a reference), you will avoid dynamic allocations inside std::function with this technique (it is also generally better approach as other answer suggests).
Keep in mind, for that to work you need guarantee that this lambda will outlive the std::function. Obviously, it is not always possible, and sometime you have to capture state by (large) copy. If that happens, there is no way currently to eliminate dynamic allocations in functions, other than tinker with STL yourself (obviously, not recommended in general case, but could be done in some specific cases).
As an addendum to the already existent and correct answer, consider the following:
MyCallBack cb;
std::cerr << sizeof(std::bind(&MyCallBack::Fire, &cb)) << "\n";
auto a = [&] { cb.Fire(); };
std::cerr << sizeof(a);
This program prints 24 and 8 for me, with both gcc and clang. I don't exactly know what bind is doing here (my understanding is that it's a fantastically complicated beast), but as you can see, it's almost absurdly inefficient here compared to a lambda.
As it happens, std::function is guaranteed to not allocate if constructed from a function pointer, which is also one word in size. So constructing a std::function from this kind of lambda, which only needs to capture a pointer to an object and should also be one word, should in practice never allocate.
Run this little hack and it probably will print the amount of bytes you can capture without allocating memory:
#include <iostream>
#include <functional>
#include <cstring>
void h(std::function<void(void*)>&& f, void* g)
{
f(g);
}
template<size_t number_of_size_t>
void do_test()
{
size_t a[number_of_size_t];
std::memset(a, 0, sizeof(a));
a[0] = sizeof(a);
std::function<void(void*)> g = [a](void* ptr) {
if (&a != ptr)
std::cout << "malloc was called when capturing " << a[0] << " bytes." << std::endl;
else
std::cout << "No allocation took place when capturing " << a[0] << " bytes." << std::endl;
};
h(std::move(g), &g);
}
int main()
{
do_test<1>();
do_test<2>();
do_test<3>();
do_test<4>();
}
With gcc version 8.3.0 this prints
No allocation took place when capturing 8 bytes.
No allocation took place when capturing 16 bytes.
malloc was called when capturing 24 bytes.
malloc was called when capturing 32 bytes.
Many std::function implementations will avoid allocations and use space inside the function class itself rather than allocating if the callback it wraps is "small enough" and has trivial copying. However, the standard does not require this, only suggests it.
On g++, a non-trivial copy constructor on a function object, or data exceeding 16 bytes, is enough to cause it to allocate. But if your function object has no data and uses the builtin copy constructor, then std::function won't allocate.
Also, if you use a function pointer or a member function pointer, it won't allocate.
While not directly part of your question, it is part of your example.
Do not use std::bind. In virtually every case, a lambda is better: smaller, better inlining, can avoid allocations, better error messages, faster compiles, the list goes on. If you want to avoid allocations, you must also avoid bind.
I propose a custom class for your specific usage.
While it's true that you shouldn't try to re-implement existing library functionality because the library ones will be much more tested and optimized, it's also true that it applies for the general case. If you have a particular situation like in your example and the standard implementation doesn't suite your needs you can explore implementing a version tailored to your specific use case, which you can measure and tweak as necessary.
So I have created a class akin to std::function<void (void)> that works only for methods and has all the storage in place (no dynamic allocations).
I have lovingly called it Trigger (inspired by your Fire method name). Please do give it a more suited name if you want to.
// helper alias for method
// can be used in user code
template <class T>
using Trigger_method = auto (T::*)() -> void;
namespace detail
{
// Polymorphic classes needed for type erasure
struct Trigger_base
{
virtual ~Trigger_base() noexcept = default;
virtual auto placement_clone(void* buffer) const noexcept -> Trigger_base* = 0;
virtual auto call() -> void = 0;
};
template <class T>
struct Trigger_actual : Trigger_base
{
T& obj;
Trigger_method<T> method;
Trigger_actual(T& obj, Trigger_method<T> method) noexcept : obj{obj}, method{method}
{
}
auto placement_clone(void* buffer) const noexcept -> Trigger_base* override
{
return new (buffer) Trigger_actual{obj, method};
}
auto call() -> void override
{
return (obj.*method)();
}
};
// in Trigger (bellow) we need to allocate enough storage
// for any Trigger_actual template instantiation
// since all templates basically contain 2 pointers
// we assume (and test it with static_asserts)
// that all will have the same size
// we will use Trigger_actual<Trigger_test_size>
// to determine the size of all Trigger_actual templates
struct Trigger_test_size {};
}
struct Trigger
{
std::aligned_storage_t<sizeof(detail::Trigger_actual<detail::Trigger_test_size>)>
trigger_actual_storage_;
// vital. We cannot just cast `&trigger_actual_storage_` to `Trigger_base*`
// because there is no guarantee by the standard that
// the base pointer will point to the start of the derived object
// so we need to store separately the base pointer
detail::Trigger_base* base_ptr = nullptr;
template <class X>
Trigger(X& x, Trigger_method<X> method) noexcept
{
static_assert(sizeof(trigger_actual_storage_) >=
sizeof(detail::Trigger_actual<X>));
static_assert(alignof(decltype(trigger_actual_storage_)) %
alignof(detail::Trigger_actual<X>) == 0);
base_ptr = new (&trigger_actual_storage_) detail::Trigger_actual<X>{x, method};
}
Trigger(const Trigger& other) noexcept
{
if (other.base_ptr)
{
base_ptr = other.base_ptr->placement_clone(&trigger_actual_storage_);
}
}
auto operator=(const Trigger& other) noexcept -> Trigger&
{
destroy_actual();
if (other.base_ptr)
{
base_ptr = other.base_ptr->placement_clone(&trigger_actual_storage_);
}
return *this;
}
~Trigger() noexcept
{
destroy_actual();
}
auto destroy_actual() noexcept -> void
{
if (base_ptr)
{
base_ptr->~Trigger_base();
base_ptr = nullptr;
}
}
auto operator()() const
{
if (!base_ptr)
{
// deal with this situation (error or just ignore and return)
}
base_ptr->call();
}
};
Usage:
struct X
{
auto foo() -> void;
};
auto test()
{
X x;
Trigger f{x, &X::foo};
f();
}
Warning: only tested for compilation errors.
You need to thoroughly test it for correctness.
You need to profile it and see if it has a better performance than other solutions. The advantage of this is because it's in house cooked you can make tweaks to the implementation to increase performance on your specific scenarios.
As #Quuxplusone mentioned in their answer-as-a-comment, you can use inplace_function here. Include the header in your project, and then use like this:
#include "inplace_function.h"
struct big { char foo[20]; };
static stdext::inplace_function<void(), 8> inplacefunc;
static std::function<void()> stdfunc;
int main() {
static_assert(sizeof(inplacefunc) == 16);
static_assert(sizeof(stdfunc) == 32);
inplacefunc = []() {};
// fine
struct big a;
inplacefunc = [a]() {};
// test.cpp:15:24: required from here
// inplace_function.h:237:33: error: static assertion failed: inplace_function cannot be constructed from object with this (large) size
// 237 | static_assert(sizeof(C) <= Capacity,
// | ~~~~~~~~~~^~~~~~~~~~~
// inplace_function.h:237:33: note: the comparison reduces to ‘(20 <= 8)’
}
I'm working on implementing fibers using coroutines implemented in assembler. The coroutines work by cocall to change stack.
I'd like to expose this in C++ using a higher level interface, as cocall assembly can only handle a single void* argument.
In order to handle template lambdas, I've experimented with converting them to a void* and found that while it compiles and works, I was left wondering if it was safe to do so, assuming ownership semantics of the stack (which are preserved by fibers).
template <typename FunctionT>
struct Coentry
{
static void coentry(void * arg)
{
// Is this safe?
FunctionT * function = reinterpret_cast<FunctionT *>(arg);
(*function)();
}
static void invoke(FunctionT function)
{
coentry(reinterpret_cast<void *>(&function));
}
};
template <typename FunctionT>
void coentry(FunctionT function)
{
Coentry<FunctionT>::invoke(function);
}
int main(int argc, const char * argv[]) {
auto f = [&]{
std::cerr << "Hello World!" << std::endl;
};
coentry(f);
}
Is this safe and additionally, is it efficient? By converting to a void* am I forcing the compiler to choose a less efficient representation?
Additionally, by invoking coentry(void*) on a different stack, but the original invoke(FunctionT) has returned, is there a chance that the stack might be invalid to resume? (would be similar to, say invoking within a std::thread I guess).
Everything done above is defined behaviour. The only performance hit is that inlining something aliased thro7gh a void pointer could be slightly harder.
However, the lambda is an actual value, and if stored in automatic storage only lasts as long as the stored-in stack frame does.
You can fix this a number of ways. std::function is one, another is to store the lambda in a shared_ptr<void> or unique_ptr<void, void(*)(void*)>. If you do not need type erasure, you can even store the lambda in a struct with deduced type.
The first two are easy. The third;
template <typename FunctionT>
struct Coentry {
FunctionT f;
static void coentry(void * arg)
{
auto* self = reinterpret_cast<Coentry*>(arg);
(self->f)();
}
Coentry(FunctionT fin):f(sts::move(fin)){}
};
template<class FunctionT>
Coentry<FunctionT> make_coentry( FunctionT f ){ return {std::move(f)}; }
now keep your Coentry around long enough until the task completes.
The details of how you manage lifetime depend on the structure of the rest of your problem.
I'm having issues with getting a partially-qualified function object to call later, with variable arguments, in another thread.
In GCC, I've been using a macro and typedef I made but I'm finishing up my project an trying to clear up warnings.
#define Function_Cast(func_ref) (SubscriptionFunction*) func_ref
typedef void(SubscriptionFunction(void*, std::shared_ptr<void>));
Using the Function_Cast macro like below results in "warning: casting between pointer-to-function and pointer-to-object is conditionally-supported"
Subscriber* init_subscriber = new Subscriber(this, Function_Cast(&BaseLoaderStaticInit::init), false);
All I really need is a pointer that I can make a std::bind<function_type> object of. How is this usually done?
Also, this conditionally-supported thing is really annoying. I know that on x86 my code will work fine and I'm aware of the limitations of relying on that sizeof(void*) == sizeof(this*) for all this*.
Also, is there a way to make clang treat function pointers like data pointers so that my code will compile? I'm interested to see how bad it fails (if it does).
Relevant Code:
#define Function_Cast(func_ref) (SubscriptionFunction*) func_ref
typedef void(SubscriptionFunction(void*, std::shared_ptr<void>));
typedef void(CallTypeFunction(std::shared_ptr<void>));
Subscriber(void* owner, SubscriptionFunction* func, bool serialized = true) {
this->_owner = owner;
this->_serialized = serialized;
this->method = func;
call = std::bind(&Subscriber::_std_call, this, std::placeholders::_1);
}
void _std_call(std::shared_ptr<void> arg) { method(_owner, arg); }
The problem here is that you are trying to use a member-function pointer in place of a function pointer, because you know that, under-the-hood, it is often implemented as function(this, ...).
struct S {
void f() {}
};
using fn_ptr = void(*)(S*);
void call(S* s, fn_ptr fn)
{
fn(s);
delete s;
}
int main() {
call(new S, (fn_ptr)&S::f);
}
http://ideone.com/fork/LJiohQ
But there's no guarantee this will actually work and obvious cases (virtual functions) where it probably won't.
Member functions are intended to be passed like this:
void call(S* s, void (S::*fn)())
and invoked like this:
(s->*fn)();
http://ideone.com/bJU5lx
How people work around this when they want to support different types is to use a trampoline, which is a non-member function. You can do this with either a static [member] function or a lambda:
auto sub = new Subscriber(this, [](auto* s){ s->init(); });
or if you'd like type safety at your call site, a templated constructor:
template<typename T>
Subscriber(T* t, void(T::*fn)(), bool x);
http://ideone.com/lECOp6
If your Subscriber constructor takes a std::function<void(void))> rather than a function pointer you can pass a capturing lambda and eliminate the need to take a void*:
new Subscriber([this](){ init(); }, false);
it's normally done something like this:
#include <functional>
#include <memory>
struct subscription
{
// RAII unsubscribe stuff in destructor here....
};
struct subscribable
{
subscription subscribe(std::function<void()> closure, std::weak_ptr<void> sentinel)
{
// perform the subscription
return subscription {
// some id so you can unsubscribe;
};
}
//
//
void notify_subscriber(std::function<void()> const& closure,
std::weak_ptr<void> const & sentinel)
{
if (auto locked = sentinel.lock())
{
closure();
}
}
};
This was already touched in Why C++ lambda is slower than ordinary function when called multiple times? and C++0x Lambda overhead
But I think my example is a bit different from the discussion in the former and contradicts the result in the latter.
On the search for a bottleneck in my code I found a recusive template function that processes a variadic argument list with a given processor function, like copying the value into a buffer.
template <typename T>
void ProcessArguments(std::function<void(const T &)> process)
{}
template <typename T, typename HEAD, typename ... TAIL>
void ProcessArguments(std::function<void(const T &)> process, const HEAD &head, const TAIL &... tail)
{
process(head);
ProcessArguments(process, tail...);
}
I compared the runtime of a program that uses this code together with a lambda function as well as a global function that copies the arguments into a global buffer using a moving pointer:
int buffer[10];
int main(int argc, char **argv)
{
int *p = buffer;
for (unsigned long int i = 0; i < 10E6; ++i)
{
p = buffer;
ProcessArguments<int>([&p](const int &v) { *p++ = v; }, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}
}
compiled with g++ 4.6 and -O3 measuring with the tool time takes more than 6 seconds on my machine while
int buffer[10];
int *p = buffer;
void CopyIntoBuffer(const int &value)
{
*p++ = value;
}
int main(int argc, char **argv)
{
int *p = buffer;
for (unsigned long int i = 0; i < 10E6; ++i)
{
p = buffer;
ProcessArguments<int>(CopyIntoBuffer, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}
return 0;
}
takes about 1.4 seconds.
I do not get what is going on behind the scenes that explains the time overhead and am wondering if I can change something to make use of lambda functions without paying with runtime.
The problem here is your usage of std::function.
You send it by copy and therefore copying its contents (and doing that recursively as you unwind parameters).
Now, for pointer to function, contents is, well, just pointer to function.
For lambda, contents are at least pointer to function + reference that you captured. This is twice as much to copy. Plus, because of std::function's type erasure copying any data will most likely be slower (not inlined).
There are several options here, and the best would probably be passing not std::function, but template instead. The benefits are that your method call is more likely to be inlined, no type erasure happens by std::function, no copying happens, everything is so very good. Like that:
template <typename TFunc>
void ProcessArguments(const TFunc& process)
{}
template <typename TFunc, typename HEAD, typename ... TAIL>
void ProcessArguments(const TFunc& process, const HEAD &head, const TAIL &... tail)
{
process(head);
ProcessArguments(process, tail...);
}
Second option is doing the same, but sending the process by copy. Now, copying does happen, but still is neatly inlined.
What's equally important is that process' body can also be inlined, especially for lamda. Depending on complexity of copying the lambda object and its size, passing by copy may or may not be faster than passing by reference. It may be faster because compiler may have harder time reasoning about reference than the local copy.
template <typename TFunc>
void ProcessArguments(TFunc process)
{}
template <typename TFunc, typename HEAD, typename ... TAIL>
void ProcessArguments(TFunc process, const HEAD &head, const TAIL &... tail)
{
process(head);
ProcessArguments(process, tail...);
}
Third option is, well, try passing std::function<> by reference. This way you at least avoid copying, but calls will not be inlined.
Here are some perf results (using ideones' C++11 compiler).
Note that, as expected, inlined lambda body is giving you best performance:
Original function:
0.483035s
Original lambda:
1.94531s
Function via template copy:
0.094748
### Lambda via template copy:
0.0264867s
Function via template reference:
0.0892594s
### Lambda via template reference:
0.0264201s
Function via std::function reference:
0.0891776s
Lambda via std::function reference:
0.09s