C-like callback handling: which algorithm preforms faster? - c++

I have an array of call backs like this void (*callbacks[n])(void* sender) and I'm wondering which one of these codes will preform faster :
//Method A
void nullcallback(void* sender){};
void callbacka(void* sender)
{
printf("Hello ");
}
void callbackb(void* sender)
{
printf("world\n");
}
int main()
{
void (*callbacks[5])(void* sender);
unsigned i;
for (i=0;i<5;++i)
callbacks[i] = nullcallback;
callbacks[2] = callbacka;
callbacks[4] = callbackb;
for (i=0;i<5;++i)
callbacks[i](NULL);
};
or
//Method B
void callbacka(void* sender)
{
printf("Hello ");
}
void callbackb(void* sender)
{
printf("world\n");
}
int main()
{
void (*callbacks[5])(void* sender);
unsigned i;
for (i=0;i<5;++i)
callbacks[i] = NULL;
callbacks[2] = callbacka;
callbacks[4] = callbackb;
for (i=0;i<5;++i)
if (callbacks[i])
callbacks[i](NULL);
};
some conditions:
Does it matter if I know most of my callbacks are valid or not?
Does it make a difference if I'm compiling my code using C or C++ compiler?
Does the target platform (windows, linux, mac, iOS, android) change any thing in the results? (the whole reason for this callback array is to manage callbacks in a game)

You'd have to look into the assembler code for that. On my platform (gcc, 32bit) I found that the compiler is not able to optimize the call to nullcallback out. But if I improve your method A to the following
int main(void) {
static void (*const callbacks[5])(void* sender) = {
[0] = nullcallback,
[1] = nullcallback,
[2] = callbacka,
[3] = nullcallback,
[4] = callbackb,
};
for (unsigned i=0;i<5;++i)
callbacks[i](0);
};
the compiler is able to unroll the loop and optimize the calls the result is just
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $0, (%esp)
call callbacka
movl $0, (%esp)
call callbackb
xorl %eax, %eax
leave
ret
.size main, .-main

This totally depends on your actual situation. If possible I would prefer methode A, because it is simply easier to read and produce cleaner code, in particular if your function has a return value:
ret = callbacks[UPDATE_EVENT](sender);
// is nicer then
if (callbacks[UPDATE_EVENT])
ret = callbacks[UPDATE_EVENT](sender);
else
ret = 0;
Of course methode A becomes tedouis when you have not only one function signature but let's say 100 different signature. And for each you have to write a null function.
For the performance consideration it depends if the nullcallback() is a rare case or not. If it is rare, methode A is obviously faster. If not methode B could be slightly faster, but that depends on many factors: which platform you use, how many arguments your functions have, etc. But in any case if your callbacks are doing "real work", ie. not only some simple calculations, it shouldn't matter at all.
Where your methode B could really be faster is when you not only call the callback for one sender but for very many:
extern void *senders[SENDERS_COUNT]; // SENDERS_COUNT is a large number
if (callbacks[UPDATE_EVENT])
{
for (int i = 0; i < SENDERS_COUNT; i++)
callbacks[UPDATE_EVENT](senders[i]);
}
Here the entire loop is skipped when there is no valid callback. This tweak can also be done with methode A if the nullcallback() address is known, ie. not defined in some module only.

You could optimize your code further by simply zero-initializing the array to start with like:
void (*callbacks[5])(void* sender) = { 0 };
Then you've completely eliminated the need for your for-loop to set each pointer to NULL. You now just have to make assignments for callbacka and callbackb.

For the general case method B is preferred, but for function pointer LUTs when NULL is the exception than method A is microscopically faster.
The primary example is Linux system call table, NULL calls should only occur in rare circumstances when running binaries built on newer systems, or programmer error. Systems calls occur often enough that nanosecond or even picosecond improvements can help.
Other instances it may prove worthy is for opcode LUTs inside emulators such as MAME.

Related

C++ Nullptr vs Null-Object for potential noop function arguments?

TL;DR : Should we use fn(Interface* pMaybeNull) or fn(Interface& maybeNullObject) -- specifically in the case of "optional" function arguments of a virtual/abstract base class?
Our code base contains various forms of the following pattern:
struct CallbackBase {
virtual ~CallbackBase() = default;
virtual void Hello(/*omitted ...*/) = 0;
};
...
void DoTheThing(..., CallbackBase* pOpt) {
...
if (pOpt) { pOpt->Hello(...); }
}
where the usage site would look like:
... {
auto greet = ...;
...
DoTheThing(..., &greet);
// or if no callback is required from call site:
DoTheThing(..., nullptr);
}
It has been proposed that, going forward, we should use a form of the Null-Object-Pattern. like so:
struct NoopCall : public CallbackBase {
virtual void Hello(/*omitted ...*/) { /*noop*/ }
};
void DoTheThing2(..., CallbackBase& opt) {
...
opt.Hello(...);
}
... {
NoopCall noop;
// if no callback is required from call site:
DoTheThing2(..., noop);
}
Note: Search variations yield lots of results regarding Null-Object (many not in the C++ space), a lot of very basic treatment of pointer vs. references and if you include the word "optional", as-in the parameter is optional, you obviously get a lot of hits regarding std::optional which, afaik, is unsuitable for this virtual interface use case.
I couldn't find a decent comparison of the two variants present here, so here goes:
Given C++17/C++20 and a halfway modern compiler, is there any expected difference in the runtime characteristics of the two approaches? (this factor being just a corollary to the overall design choice.)
The "Null Object" approach certainly "seems" more modern and safer to me -- is there anything in favor of the pointer approach?
Note:
I think it is orthogonal to the question posed, whether it stands as posted, or uses a variant of overloading or default arguments.
That is, the question should be valid, regardless of:
//a
void DoTheThing(arg);
// vs b
void DoTheThing(arg=nullthing);
// vs c
void DoTheThing(arg); // overload1
void DoTheThing(); // overload0 (calling 1 internally)
Performance:
I inspected the code on godbolt and while MSVC shows "the obvious", the gcc output is interesting (see below).
// Gist for a MCVE.
"The obvious" is that the version with the Noop object contains an unconditional virtual call to Hello and the pointer version has an additional pointer test, eliding the call if the pointer is null.
So, if the function is "always" called with a valid callback, the pointer version is a pessimization, paying an additional null check.
If the function is "never" called with a valid callback, the NullObject version is a (worse) pessimization, paying a virtual call that does nothing.
However, the object version in the gcc code contains this:
WithObject(int, CallbackBase&):
...
mov rax, QWORD PTR [rsi]
...
mov rax, QWORD PTR [rax+16]
(!) cmp rax, OFFSET FLAT:NoopCaller::Hello(HelloData const&)
jne .L31
.L25:
...
.L31:
mov rdi, rsi
mov rsi, rsp
call rax
jmp .L25
And while my understanding of assembly is certainly near non existent, this looks like gcc is comparing the call pointer to the NoopCaller::Hello function, and eliding the call in this case!
Conclusion
In general, the pointer version should produce more optimal code on the micro-level. However, compiler optimizations might make any difference near non-observable.
Think about using the pointer version if you have a very hot path where the callback is null.
Use the null object version otherwise, as it is arguably safer and more maintainable.

Do branch likelihood hints carry through function calls?

I've come across a few scenarios where I want to say a function's return value is likely inside the body of a function, not the if statement that will call it.
For example, say I want to port code from using a LIKELY macro to using the new [[likely]] annotation. But these go in syntactically different places:
#define LIKELY(...) __builtin_expect(!!(__VA_ARGS__),0)
if(LIKELY(x)) { ... }
vs
if(x) [[likely]] { ... }
There's no easy way to redefine the LIKELY macro to use the annotation. Would defining a function like
inline bool likely(bool x) {
if(x) [[likely]] return true;
else return false;
}
propagate the hint out to an if? Like in
if(likely(x)) { ... }
Similarly, in generic code, it can be difficult to directly express algorithmic likelihood information in the actual if statement, even if this information is known elsewhere. For example, a copy_if where the predicate is almost always false. As far as I know, there is no way to express that using attributes, but if branch weight info can propagate through functions, this is a solved problem.
So far I haven't been able to find documentation about this and I don't know a good setup to test this by looking at the outputted assembly.
The story appears to be mixed for different compilers.
On GCC, I think your inline likely function works, or at least has some effect. Using Compiler Explorer to test differences on this code:
inline bool likely(bool x) {
if(x) [[likely]] return true;
else return false;
}
//#define LIKELY(x) likely(x)
#define LIKELY(x) x
int f(int x) {
if (LIKELY(!x)) {
return -3548;
}
else {
return x + 1;
}
}
This function f adds 1 to x and returns it, unless x is 0, in which case it returns -3548. The LIKELY macro, when it's active, indicates to the compiler that the case where x is zero is more common.
This version, with no change, produces this assembly under GCC 10 -O1:
f(int):
test edi, edi
je .L3
lea eax, [rdi+1]
ret
.L3:
mov eax, -3548
ret
With the #define changed to the inline function with the [[likely]], we get:
f(int):
lea eax, [rdi+1]
test edi, edi
mov edx, -3548
cmove eax, edx
ret
That's a conditional move instead of a conditional jump. A win, I guess, albeit for a simple example.
This indicates that branch weights propagate through inline functions, which makes sense.
On clang, however, there is limited support for the likely and unlikely attributes, and where there is it does not seem to propagate through inline function calls, according to #Peter Cordes 's report.
There is, however, a hacky macro solution that I think also works:
#define EMPTY()
#define LIKELY(x) x) [[likely]] EMPTY(
Then anything like
if ( LIKELY(x) ) {
becomes like
if ( x) [[likely]] EMPTY( ) {
which then becomes
if ( x) [[likely]] {
.
Example: https://godbolt.org/z/nhfehn
Note however that this probably only works in if-statements, or in other cases that the LIKELY is enclosed in parentheses.
gcc 10.2 at least is able to make this deduction (with -O2).
If we consider the following simple program:
void foo();
void bar();
void baz(int x) {
if (x == 0)
foo();
else
bar();
}
then it compiles to:
baz(int):
test edi, edi
jne .L2
jmp foo()
.L2:
jmp bar()
However if we add [[likely]] on the else clause, the generated code changes to
baz(int):
test edi, edi
je .L4
jmp bar()
.L4:
jmp foo()
so that the not-taken case of the conditional branch corresponds to the "likely" case.
Now if we pull the comparison out into an inline function:
void foo();
void bar();
inline bool is_zero(int x) {
if (x == 0)
return true;
else
return false;
}
void baz(int x) {
if (is_zero(x))
foo();
else
bar();
}
we are again back to the original generated code, taking the branch in the bar() case. But if we add [[likely]] on the else clause in is_zero, we see the branch reversed again.
clang 10.0.1 however does not demonstrate this behavior and seems to ignore [[likely]] altogether in all versions of this example.
Yes, it will probably inline, but this is quite pointless.
The __builtin_expect will continue to work even after you upgrade to a compiler that supports those C++ 20 attributes. You can refactor them later, but it will be for purely aesthetic reasons.
Also, your implementation of the LIKELY macro is erroneous (it is actually UNLIKELY), the correct implementations are nelow.
#define LIKELY( x ) __builtin_expect( !! ( x ), 1 )
#define UNLIKELY( x ) __builtin_expect( !! ( x ), 0 )

Can C++ templates be used for conditional code inclusion?

So that:
template <bool Mode>
void doIt()
{
//many lines
template_if(Mode)
{
doSomething(); // and not waste resources on if
}
//many other lines
}
I know there is enable_if command that can be used for enabling the function conditionally, but I do not think I can use it such option here.
Essentially what I need is template construct that acts as #ifdef macro.
Before trying something complex it's often worth checking if the simple solution already achieves what you want.
The simplest thing I can think of is to just use an if:
#include <iostream>
void doSomething()
{
std::cout << "doing it!" << std::endl;
}
template <bool Mode>
void doIt()
{
//many lines
if(Mode)
{
doSomething(); // and not waste resources on if
}
//many other lines
}
void dont()
{
doIt<false>();
}
void actuallyDoIt()
{
doIt<true>();
}
So what does that give:
gcc 5.3 with no optimizations enabled gives:
void doIt<false>():
pushq %rbp
movq %rsp, %rbp
nop
popq %rbp
ret
void doIt<true>():
pushq %rbp
movq %rsp, %rbp
call doSomething()
nop
popq %rbp
ret
Note no doSomething() call in the false case just the bare work of the doIt function call. Turning optimizations on would eliminate even that.
So we already get what we want and are not wasting anything in the if. It's probably good to leave it at that rather than adding any unneeded complexity.
It can sort of be done.
If the code inside your "if" is syntactically and semantically valid for the full set of template arguments that you intend to provide, then you can basically just write an if statement. Thanks to basic optimisations, if (someConstant) { .. } is not going to survive compilation when someConstant is false. And that's that.
However, if the conditional code is actually not valid when the condition isn't met, then you can't do this. That's because class templates and function templates are instantiated ... in full. Your entire function body is instantiated so it all has to be valid. There's no such thing as instantiating an arbitrary block of code.†
So, in that case, you'd have to go back to messy old function specialisation with enable_if or whatever.
† C++17 is likely to have if constexpr which essentially gives you exactly this. But that's future talk.
You could specialize your template so that your code is only used when the template parameter is true:
template < typename _Cond > struct condition {};
template <> struct condition<false> {
static /* constexpr */ void do_something() {};
}
template <> struct condition<true> {
static void do_something() {
// Actual code
}
}
// Usage:
condition<true>::do_something();
condition<compiletime_constant>::do_something();

Can I persude GCC to inline a deferred call through a stored function pointer?

Naturally, C++ compilers can inline function calls made from within a function template, when the inner function call is directly known in that scope (ref).
#include <iostream>
void holyheck()
{
std::cout << "!\n";
}
template <typename F>
void bar(F foo)
{
foo();
}
int main()
{
bar(holyheck);
}
Now what if I'm passing holyheck into a class, which stores the function pointer (or equivalent) and later invokes it? Do I have any hope of getting this inlined? How?
template <typename F>
struct Foo
{
Foo(F f) : f(f) {};
void calledLater() { f(); }
private:
F f;
};
void sendMonkeys();
void sendTissues();
int main()
{
Foo<void(*)()> f(sendMonkeys);
Foo<void(*)()> g(sendTissues);
// lots of interaction with f and g, not shown here
f.calledLater();
g.calledLater();
}
My type Foo is intended to isolate a ton of logic; it will be instantiated a few times. The specific function invoked from calledLater is the only thing that differs between instantiations (though it never changes during the lifetime of a Foo), so half of the purpose of Foo is to abide by DRY. (The rest of its purpose is to keep this mechanism isolated from other code.)
But I don't want to introduce the overhead of an actual additional function call in doing so, because this is all taking place in a program bottleneck.
I don't speak ASM so analysing the compiled code isn't much use to me.
My instinct is that I have no chance of inlining here.
If you don't really need to use a function pointer, then a functor should make the optimisation trivial:
struct CallSendMonkeys {
void operator()() {
sendMonkeys();
}
};
struct CallSendTissues {
void operator()() {
sendTissues();
}
};
(Of course, C++11 has lambdas, but you tagged your question C++03.)
By having different instantiations of Foo with these classes, and having no internal state in these classes, f() does not depend on how f was constructed, so it's not a problem if a compiler can't tell that it remains unmodified.
With your example, that after fiddling to make it compile looks like this:
template <typename F>
struct Foo
{
Foo(F f) : f(f) {};
void calledLater() { f(); }
private:
F f;
};
void sendMonkeys();
void sendTissues();
int main()
{
Foo<__typeof__(&sendMonkeys)> f(sendMonkeys);
Foo<__typeof__(&sendTissues)> g(sendTissues);
// lots of interaction with f and g, not shown here
f.calledLater();
g.calledLater();
}
clang++ (3.7 as of a few weeks back which means I'd expect clang++3.6 to do this, as it's only a few weeks older in source-base) generates this code:
.text
.file "calls.cpp"
.globl main
.align 16, 0x90
.type main,#function
main: # #main
.cfi_startproc
# BB#0: # %entry
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
callq _Z11sendMonkeysv
callq _Z11sendTissuesv
xorl %eax, %eax
popq %rdx
retq
.Ltmp1:
.size main, .Ltmp1-main
.cfi_endproc
Of course, without a definition of sendMonkeys and sendTissues, we can't really inline any further.
If we implement them like this:
void request(const char *);
void sendMonkeys() { request("monkeys"); }
void sendTissues() { request("tissues"); }
the assembler code becomes:
main: # #main
.cfi_startproc
# BB#0: # %entry
pushq %rax
.Ltmp2:
.cfi_def_cfa_offset 16
movl $.L.str, %edi
callq _Z7requestPKc
movl $.L.str1, %edi
callq _Z7requestPKc
xorl %eax, %eax
popq %rdx
retq
.L.str:
.asciz "monkeys"
.size .L.str, 8
.type .L.str1,#object # #.str1
.L.str1:
.asciz "tissues"
.size .L.str1, 8
Which, if you can't read assembler code is request("tissues") and request("monkeys") inlined as per expected.
I'm simply amazed that g++ 4.9.2. doesn't do the same thing (I got this far and expected to continue with "and g++ does the same, I'm not going to post the code for it"). [It does inline sendTissues and sendMonkeys, but doesn't go the next step to inline request as well]
Of course, it's entirely possible to make tiny changes to this and NOT get the code inlined - such as adding some conditions that depend on variables that the compiler can't determine at compile-time.
Edit:
I did add a string and an integer to Foo and updated these with an external function, at which point the inlining went away for both clang and gcc. Using JUST an integer and calling an external function, it does inline the code.
In other words, it really depends on what the code is in the section
// lots of interaction with f and g, not shown here. And I think you (Lightness) have been around here long enough to know that for 80%+ of the questions, it's the code that isn't posted in the question that is the most important part for the actual answer ;)
To make your original approach work, use
template< void(&Func)() >
struct Foo
{
void calledLater() { Func(); }
};
In general, I've had better luck getting gcc to inline things by using function references rather than function pointers.

C++ pimpl idiom wastes an instruction vs. C style?

(Yes, I know that one machine instruction usually doesn't matter. I'm asking this question because I want to understand the pimpl idiom, and use it in the best possible way; and because sometimes I do care about one machine instruction.)
In the sample code below, there are two classes, Thing and
OtherThing. Users would include "thing.hh".
Thing uses the pimpl idiom to hide it's implementation.
OtherThing uses a C style – non-member functions that return and take
pointers. This style produces slightly better machine code. I'm
wondering: is there a way to use C++ style – ie, make the functions
into member functions – and yet still save the machine instruction. I like this style because it doesn't pollute the namespace outside the class.
Note: I'm only looking at calling member functions (in this case, calc). I'm not looking at object allocation.
Below are the files, commands, and the machine code, on my Mac.
thing.hh:
class ThingImpl;
class Thing
{
ThingImpl *impl;
public:
Thing();
int calc();
};
class OtherThing;
OtherThing *make_other();
int calc(OtherThing *);
thing.cc:
#include "thing.hh"
struct ThingImpl
{
int x;
};
Thing::Thing()
{
impl = new ThingImpl;
impl->x = 5;
}
int Thing::calc()
{
return impl->x + 1;
}
struct OtherThing
{
int x;
};
OtherThing *make_other()
{
OtherThing *t = new OtherThing;
t->x = 5;
}
int calc(OtherThing *t)
{
return t->x + 1;
}
main.cc (just to test the code actually works...)
#include "thing.hh"
#include <cstdio>
int main()
{
Thing *t = new Thing;
printf("calc: %d\n", t->calc());
OtherThing *t2 = make_other();
printf("calc: %d\n", calc(t2));
}
Makefile:
all: main
thing.o : thing.cc thing.hh
g++ -fomit-frame-pointer -O2 -c thing.cc
main.o : main.cc thing.hh
g++ -fomit-frame-pointer -O2 -c main.cc
main: main.o thing.o
g++ -O2 -o $# $^
clean:
rm *.o
rm main
Run make and then look at the machine code. On the mac I use otool -tv thing.o | c++filt. On linux I think it's objdump -d thing.o. Here is the relevant output:
Thing::calc():
0000000000000000 movq (%rdi),%rax
0000000000000003 movl (%rax),%eax
0000000000000005 incl %eax
0000000000000007 ret
calc(OtherThing*):
0000000000000010 movl (%rdi),%eax
0000000000000012 incl %eax
0000000000000014 ret
Notice the extra instruction because of the pointer indirection. The first function looks up two fields (impl, then x), while the second only needs to get x. What can be done?
One instruction is rarely a thing to spend much time worrying over. Firstly, the compiler may cache the pImpl in a more complex use case, thus amortising the cost in a real-world scenario. Secondly, pipelined architectures make it almost impossible to predict the real cost in clock cycles. You'll get a much more realistic idea of the cost if you run these operations in a loop and time the difference.
Not too hard, just use the same technique inside your class. Any halfway decent optimizer will inline
the trivial wrapper.
class ThingImpl;
class Thing
{
ThingImpl *impl;
static int calc(ThingImpl*);
public:
Thing();
int calc() { calc(impl); }
};
There's the nasty way, which is to replace the pointer to ThingImpl with a big-enough array of unsigned chars and then placement/new reinterpret cast/explicitly destruct the ThingImpl object.
Or you could just pass the Thing around by value, since it should be no larger than the pointer to the ThingImpl, though may require a little more than that (reference counting of the ThingImpl would defeat the optimisation, so you need some way of flagging the 'owning' Thing, which might require extra space on some architectures).
I disagree about your usage: you are not comparing the 2 same things.
#include "thing.hh"
#include <cstdio>
int main()
{
Thing *t = new Thing; // 1
printf("calc: %d\n", t->calc());
OtherThing *t2 = make_other(); // 2
printf("calc: %d\n", calc(t2));
}
You have in fact 2 calls to new here, one is explicit and the other is implicit (done by the constructor of Thing.
You have 1 new here, implicit (inside 2)
You should allocate Thing on the stack, though it would not probably change the double dereferencing instruction... but could change its cost (remove a cache miss).
However the main point is that Thing manages its memory on its own, so you can't forget to delete the actual memory, while you definitely can with the C-style method.
I would argue that automatic memory handling is worth an extra memory instruction, specifically because as it's been said, the dereferenced value will probably be cached if you access it more than once, thus amounting to almost nothing.
Correctness is more important than performance.
Let the compiler worry about it. It knows far more about what is actually faster or slower than we do. Especially on such a minute scale.
Having items in classes has far, far more benefits than just encapsulation. PIMPL's a great idea, if you've forgotten how to use the private keyword.