Forcing inlining of callback (lambda) in C++17 in library - c++

I'm building a runtime system that allows a programmer to specify a callback that is invoked at particular points. I'm using clang 7.0.1 / -std=c++17. The callback is registered in the runtime by storing a lambda as a std::function. When the runtime later calls the std::function callback, it passes 6 arguments (a necessity given the generality of the runtime). Note that the std::function is being created in an application but is used by a statically-linked library, which is compiled separately. However, I'm using LTO (via -flto and LLD 7.0.1) so I was hoping it would be able to still do this optimization. I'm new to some of this stuff so hopefully this is possible.
When I compile with -O3 and specify __attribute__((flatten)) on the calling function declaration, the lambda is not inlined. I can see when I run my system using perf events that the function isn't being inlined:
return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
mov -0x90(%rbp),%rdi
lea -0x48(%rbp),%rsi
mov %rbx,%rdx
mov %r15,%rbx
callq *0x180(%r15)
...
This call is taking a non-trivial amount of time and it seems like something that should be inlinable; there are only a few callsites in total. I've certainly seen lambdas inlined before but I'm not sure if my approach of using a functor (via std::function) somehow disqualifies inlining.
Is forcing an inline possible? Let me know if more info is needed here.
EDIT:
Thanks for all of the very useful information. I realize now that the way I've set up my runtime, it's not giving the compiler a chance to inline the callback. The comments make it clear why that is the case. There were some allusions to alternative approaches that may be inlinable. Given that 1) I'm in control of both the application and runtime source (and programming models / APIs); 2) I'm compiling both the library and application at once (and can even make them a unified build process), are there alternative approaches that I could take here that would potentially allow an inline to occur? Maybe templates and lambdas (not std::functions)? I'm new to this area and am all ears if anyone has ideas on how to effectively give the compiler what it needs to inline. Worst case scenario, I can even build a custom version of the library (as a proof of concept) for each application if that opens up any possibilities...

The whole point of std::function is to have a common type that can hold an arbitrary callable for a certain signature while, at the same time, allowing that arbitrary callable to be invoked through a common interface no matter what kind of thing the callable actually happened to be. Thus, if you think about it, std::function inherently requires an indirection of some sort. Which code needs to run for calling an std::function depends not just on the type, but on the particular value of the std::function. This makes std::function (at least the call to the stored callable) inherently not inlineable. The code generated for the function calling your callback has to be able to handle any std::function you may possibly throw at it. The only way a compiler could potentially offer something like inlining for std::function would be if it was somehow able to figure out that your function calling the callback is most of the time only going to be used with std::function objects holding a particular value and then generate a clone of the function calling the callback for that specific case. This would either require an almost unrealistically clairvoyant compiler to arrive at in general, or a lot of magic hardwired into the compiler just for std::function specifically. It's not completely impossible in theory. But I've never witnessed any compiler actually being able to do anything like that. In my experience, optimizers are just not really able to see through std::function. And I would not expect that to change anytime soon, as getting any meaningful optimization there would seem to require huge amounts of effort for a rather questionable benefit. std::function is just heavy machinery to begin with. You simply pay for what you use there. If you can't pay the price, don't use std::function…

Related

Is there a useful "reference" that can be acquired from a std::function back to the stored function/lambda

In C if one wants to know/acquire a useful reference back to a callback/some other function, it can be done quite easily be casting the function to a void*. Later, in debug for example, the pointer be examined and traced back to the original function (again as an example, via the compiler map output or even the in editor debugger).
This sort of information is very useful when using "breadcrumbs" - eg. a circular buffer of void*s - to debug the flow of an application.
In C++ with std::function, it is possible to get a raw pointer via the target<func_type>() member function, however this only works if you know ahead of time the precise type of the stored target (i.e. a standard C function), however when storing lambdas this is no longer the case as the target is not longer a simple void() for a std::function<void()> and will therefore return nullptr.
The above being the case, what is the next best reference once can get to the stored target such that it can be used, preferably, after the execution of the application has finished or while execution is paused with the debugger attached?
Alternatively can the target be acquired via some template magic, while still retaining the semantics/usability of std::function within the library code? This would need to include capturing lambdas and pure C functions
A few notes:
I am not asking how to do debugging
I am not asking how debug information could be captured from within the stored target - imagine this is a library rather than client code
A std::function implements a template method called target() that retrieves a pointer to the underlying type. That's pretty much all you can get from a std::function.
Note, that you must know what the underlying, erased type, that's in the std::function. This is fundamental to C++: the type of all objects must be known at compile time. There are no workarounds to this, there are no magic types that get instantiated at runtime, depending on some runtime conditions. You cannot declare a pointer of some kind, and have it pop into existence, at runtime, as a pointer to a function, or a pointer to a lambda, or some other thing, depending on what's in the std::function. This goes down to immutable fundamental principles of what C++ is, and how it works.
or while execution is paused with the debugger attached
Nothing special needs to be done to that effect. If paused before invoking a std::function, your debugger shouldn't have any issues stepping into the type-erased function call, step by step, and show you where it comes out. Whether or not your debugger can display the virtual objects tucked away in the std::function depends entirely on your debugger's capabilities, but the underlying type information is there, and should be inspectable in a debugger.

Overhead with std::function

I have seen many instances where people have advised against using std::function<> because it is a heavyweight mechanism. Could someone please explain why that is so?
std::function is a type erasure class.
It takes whatever it is constructed from, and erases everything except:
Invoke with the signature in question (with possible implicit casting)
Destroy
Copy
Cast back to exact original type
and possibly
Move
This involves some overhead. A typical decent-quality std::function will have small object optimization (like small string optimization), avoiding a heap allocation when the amount of memory used is small.
A function pointer will fit in there.
However, there is still overhead. If you initialize a std::function with a compatible function pointer, instead of directly calling the function pointer in question, you do a virtual function table lookup, or invoke some other function, which then invokes the function pointer.
With a vtable implementation, that is a possible cache miss, an instruction cache miss, then another instruction cache miss. With a function pointer, the pointer is probably stored locally, and it is called directly, resulting on one possible instruction cache miss.
On top of this, in practice compilers understand function pointers better than std::functions: a number of compilers can figure out that the pointer is constant value during inlining or whole program optimization. I have never seen one that pulls that off with std::function.
For larger objects (say larger than sizeof(std::string) in one implementation), a heap allocation is also done by the std::function. This is another cost. For function pointers and reference wrappers, SOO is guaranteed by the standard.
Directly storing the lambda without storing it in a std::function is even better than a function pointer: in that case, the code being run is implicit in the type of the lambda. This makes it trivial for code to work out what is going to happen when it is called, and inlining easy for the compiler.
Only do type erasure when you need to.
Under the hood, std::function typically uses type erasure (one simplified explanation for how it may be implemented is here). The cost of storing your function object inside the std::function object may involve a heap allocation. The cost of invoking your function object is typically an indirection through a pointer plus a virtual function call. Also, while compilers are getting better at this, the virtual function call usually inhibits inlining of your function.
That being said, I recommend using std::function unless you know via measurements that the cost is too high (typically when you cannot afford heap allocations, your function will be called many times in a place that requires very low latency, etc.), as it is better to write straightforward code than to prematurely optimize.
Depending of the implementation, std::function will add some overhead due to the use of type easure. They have been some other implementation such as Don Clugston's fast delegate, with a C++11 implementation here. Please note that it uses UB to make the fastest possible delegate, but is still extremely portable.
If you want type erasure it's the right tool for the job and almost certainly not your bottleneck and not something you could write faster anyway.
However sometimes it can be all to tempting to use type erasure when it really isn't required. That's where to draw the line. For example if all you want to do is keep hold of a lambda locally then it's probably not the right tool and you should just use:
auto l = [](){};
Likewise for function pointers you don't plan to type erase - just use a function pointer type.
You also don't need type erasure for templates from <algorithm> or your own equivalents because there's simply no need for heterogenous functor types to coexist.
It's not so.
To put it simply, it's not too heavyweight unless you profiled your program and showed that it is too heavyweight. Since evidently you did not (otherwise you would know the answer to this question), we can safely conclude that it is in fact not too heavyweight at all.
You should always profile, before concluding that it's too slow.

Is there a reason some functions don't take a void*?

Many functions accept a function pointer as an argument. atexit and call_once are excellent examples. If these higher level functions accepted a void* argument, such as atexit(&myFunction, &argumentForMyFunction), then I could easily wrap any functor I pleased by passing a function pointer and a block of data to provide statefulness.
As is, there are many cases where I wish I could register a callback with arguments, but the registration function does not allow me to pass any arguments through. atexit only accepts one argument: a function taking 0 arguments. I cannot register a function to clean up after my object, I must register a function which cleans up after all objects of a class, and force my class to maintain a list of all objects needing cleanup.
I always viewed this as an oversight, there seemed no valid reason why you wouldn't allow a measly 4 or 8 byte pointer to be passed along, unless you were on an extremely limited microcontroller. I always assumed they simply didn't realize how important that extra argument could be until it was too late to redefine the spec. In the case of call_once, the posix version accepts no arguments, but the C++11 version accepts a functor (which is virtually equivalent to passing a function and an argument, only the compiler does some of the work for you).
Is there any reason why one would choose not to allow that extra argument? Is there an advantage to accepting only "void functions with 0 arguments"?
I think atexit is just a special case, because whatever function you pass to it is supposed to be called only once. Therefore whatever state it needs to do its job can just be kept in global variables. If atexit were being designed today, it would probably take a void* in order to enable you to avoid using global variables, but that wouldn't actually give it any new functionality; it would just make the code slightly cleaner in some cases.
For many APIs, though, callbacks are allowed to take additional arguments, and not allowing them to do so would be a severe design flaw. For example, pthread_create does let you pass a void*, which makes sense because otherwise you'd need a separate function for each thread, and it would be totally impossible to write a program that spawns a variable number of threads.
Quite a number of the interfaces taking function pointers lacking a pass-through argument are simply coming from a different time. However, their signatures can't be changed without breaking existing code. It is sort of a misdesign but that's easy to say in hindsight. The overall programming style has moved on to have limited uses of functional programming within generally non-functional programming languages. Also, at the time many of these interfaces were created storing any extra data even on "normal" computers implied an observable extra cost: aside from the extra storage used, the extra argument also needs to be passed even when it isn't used. Sure, atexit() is hardly bound to be a performance bottleneck seeing that it is called just once but if you'd pass an extra pointer everywhere you'd surely also have one qsort()'s comparison function.
Specifically for something like atexit() it is reasonably straight forward to use a custom global object with which function objects to be invoked upon exit are registered: just register a function with atexit() calling all of the functions registered with said global object. Also note that atexit() is only guaranteed to register up to 32 functions although implementations may support more registered functions. It seems ill-advised to use it as a registry for object clean-up function rather than the function which calling an object clean-up function as other libraries may have a need to register functions, too.
That said, I can't imagine why atexit() is particular useful in C++ where objects are automatically destroyed upon program termination anyway. Of course, this approach assumes that all objects are somehow held but that's normally necessary anyway in some form or the other and typically done using appropriate RAII objects.

Should I use std::function or a function pointer in C++?

When implementing a callback function in C++, should I still use the C-style function pointer:
void (*callbackFunc)(int);
Or should I make use of std::function:
std::function< void(int) > callbackFunc;
In short, use std::function unless you have a reason not to.
Function pointers have the disadvantage of not being able to capture some context. You won't be able to for example pass a lambda function as a callback which captures some context variables (but it will work if it doesn't capture any). Calling a member variable of an object (i.e. non-static) is thus also not possible, since the object (this-pointer) needs to be captured.(1)
std::function (since C++11) is primarily to store a function (passing it around doesn't require it to be stored). Hence if you want to store the callback for example in a member variable, it's probably your best choice. But also if you don't store it, it's a good "first choice" although it has the disadvantage of introducing some (very small) overhead when being called (so in a very performance-critical situation it might be a problem but in most it should not). It is very "universal": if you care a lot about consistent and readable code as well as don't want to think about every choice you make (i.e. want to keep it simple), use std::function for every function you pass around.
Think about a third option: If you're about to implement a small function which then reports something via the provided callback function, consider a template parameter, which can then be any callable object, i.e. a function pointer, a functor, a lambda, a std::function, ... Drawback here is that your (outer) function becomes a template and hence needs to be implemented in the header. On the other hand you get the advantage that the call to the callback can be inlined, as the client code of your (outer) function "sees" the call to the callback will the exact type information being available.
Example for the version with the template parameter (write & instead of && for pre-C++11):
template <typename CallbackFunction>
void myFunction(..., CallbackFunction && callback) {
...
callback(...);
...
}
As you can see in the following table, all of them have their advantages and disadvantages:
function ptr
std::function
template param
can capture context variables
no1
yes
yes
no call overhead (see comments)
yes
no
yes
can be inlined (see comments)
no
no
yes
can be stored in a class member
yes
yes
no2
can be implemented outside of header
yes
yes
no
supported without C++11 standard
yes
no3
yes
nicely readable (my opinion)
no
yes
(yes)
(1) Workarounds exist to overcome this limitation, for example passing the additional data as further parameters to your (outer) function: myFunction(..., callback, data) will call callback(data). That's the C-style "callback with arguments", which is possible in C++ (and by the way heavily used in the WIN32 API) but should be avoided because we have better options in C++.
(2) Unless we're talking about a class template, i.e. the class in which you store the function is a template. But that would mean that on the client side the type of the function decides the type of the object which stores the callback, which is almost never an option for actual use cases.
(3) For pre-C++11, use boost::function
void (*callbackFunc)(int); may be a C style callback function, but it is a horribly unusable one of poor design.
A well designed C style callback looks like void (*callbackFunc)(void*, int); -- it has a void* to allow the code that does the callback to maintain state beyond the function. Not doing this forces the caller to store state globally, which is impolite.
std::function< int(int) > ends up being slightly more expensive than int(*)(void*, int) invocation in most implementations. It is however harder for some compilers to inline. There are std::function clone implementations that rival function pointer invocation overheads (see 'fastest possible delegates' etc) that may make their way into libraries.
Now, clients of a callback system often need to set up resources and dispose of them when the callback is created and removed, and to be aware of the lifetime of the callback. void(*callback)(void*, int) does not provide this.
Sometimes this is available via code structure (the callback has limited lifetime) or through other mechanisms (unregister callbacks and the like).
std::function provides a means for limited lifetime management (the last copy of the object goes away when it is forgotten).
In general, I'd use a std::function unless performance concerns manifest. If they did, I'd first look for structural changes (instead of a per-pixel callback, how about generating a scanline processor based off of the lambda you pass me? which should be enough to reduce function-call overhead to trivial levels.). Then, if it persists, I'd write a delegate based off fastest possible delegates, and see if the performance problem goes away.
I would mostly only use function pointers for legacy APIs, or for creating C interfaces for communicating between different compilers generated code. I have also used them as internal implementation details when I am implementing jump tables, type erasure, etc: when I am both producing and consuming it, and am not exposing it externally for any client code to use, and function pointers do all I need.
Note that you can write wrappers that turn a std::function<int(int)> into a int(void*,int) style callback, assuming there are proper callback lifetime management infrastructure. So as a smoke test for any C-style callback lifetime management system, I'd make sure that wrapping a std::function works reasonably well.
Use std::function to store arbitrary callable objects. It allows the user to provide whatever context is needed for the callback; a plain function pointer does not.
If you do need to use plain function pointers for some reason (perhaps because you want a C-compatible API), then you should add a void * user_context argument so it's at least possible (albeit inconvenient) for it to access state that's not directly passed to the function.
The only reason to avoid std::function is support of legacy compilers that lack support for this template, which has been introduced in C++11.
If supporting pre-C++11 language is not a requirement, using std::function gives your callers more choice in implementing the callback, making it a better option compared to "plain" function pointers. It offers the users of your API more choice, while abstracting out the specifics of their implementation for your code that performs the callback.
std::function may bring VMT to the code in some cases, which has some impact on performance.
The other answers answer based on technical merits. I'll give you an answer based on experience.
As a very heavy X-Windows developer who always worked with function pointer callbacks with void* pvUserData arguments, I started using std::function with some trepidation.
But I find out that combined with the power of lambdas and the like, it has freed up my work considerably to be able to, at a whim, throw multiple arguments in, re-order them, ignore parameters the caller wants to supply but I don't need, etc. It really makes development feel looser and more responsive, saves me time, and adds clarity.
On this basis I'd recommend anyone to try using std::function any time they'd normally have a callback. Try it everywhere, for like six months, and you may find you hate the idea of going back.
Yes there's some slight performance penalty, but I write high-performance code and I'm willing to pay the price. As an exercise, time it yourself and try to figure out whether the performance difference would ever matter, with your computers, compilers and application space.

What are the caveats to using inline optimization in C++ functions?

What would be the benefits of inlining different types of function and what are the issues that would I would need to watch out for when developing around them? I am not so useful with a profiler but many different algorithmic applications it seems to increase the speed 8 times over, if you can give any pointers that'd be of great use to me.
Inline functions are oft' overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware, that they can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Avoid inlining functions until profiling indicates which functions could benefit from inline.
Consider using your compiler's option for auto-inlining after profiling both with and without auto-inlining.
Only inline functions where the function call overhead is large relative to the function's code. In other words, inlining large functions or functions that call other (possibly inlined) functions is not a good idea.
The most important pointer is that you should in almost all cases let the compiler do its thing and not worry about it.
The compiler is free to perform inline expansion of a function even if you do not declare it inline, and it is free not to perform inline expansion even if you do declare it inline. It's entirely up to the compiler, which is okay, because in most cases it knows far better than you do when a function should be expanded inline.
One of the reason the compiler does a better job inlining than the programmer is because the cost/benefit tradeoff is actually decided at the lowest level of machine abstraction: how many assembly instructions make up the function that you want to inline. Consider the ratio between the execution time of a typical non-branching assembly instruction versus a function call. This ratio is predictable to the machine code generator, so that's why the compiler can use that information to guide inlining.
The high level compiler will often try to take care of another opportunity for inlining: when a function B is only called from function A and never called from elsewhere. This inlining is not done for performance reason (assuming A and B are not small functions), but is useful in reducing linking time by reducing the total number of "functions" that need to be generated.
Added examples
An example of where the compiler performs massive inlining (with massive speedup) is in the compilation of the STL containers. The STL container classes are written to be highly generic, and in return each "function" only performs a tiny bit of operation. When inlining is disabled, for example when compiling in debug mode, the speed of STL containers drop considerably.
A second example would be when the callee function contains certain instructions that require the stack to be undisturbed between the caller and callee. This happens with SIMD instructions using intrinsics. Fortunately, the compilers are smart enough to automatically inline these callee functions because they can inspect whether SIMD assembly instructions are emitted and inline them to make sure the stack is undisturbed.
The bottom line
unless you are familiar with low-level profiling and are good at assembly programming/optimization, it is better to let the compiler do the job. The STL is a special case in which it might make sense to enable inlining (with a switch) even in debug mode.
The main benefits of inlining a function are that you remove the calling overhead and allow the compiler to optimize across the call boundaries. Generally, the more freedom you give the optimizer, the better your program will perform.
The downside is that the function no longer exists. A debugger won't be able to tell you're inside of it, and no outside code can call it. You also can't replace its definition at run time, as the function body exists in many different locations.
Furthermore, the size of your binary is increased.
Generally, you should declare a function static if it has no external callers, rather than marking it inline. Only let a function be inlined if you're sure there are no negative side effects.
Function call overhead is pretty small. A more significant advantage of inline functions is the ability to use "by reference" variables directly without needing an extra level of pointer indirection. A function which makes heavy use of parameters passed by reference may benefit greatly if its parameters devolve to simple variables or fields.