Template version and non-template version of the same function - c++

Consider the following simple function
void foo_rt(int n) {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
If I know the parameter n at compiletime, I can write a template version of the same function:
template<int n>
void foo_ct() {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
This allows the compiler to do things like loop unrolling, which increases speed.
But assume now that I sometimes know n at compiletime and sometimes only at runtime. How can I implement this without maintaining two versions of the function? I was thinking something along the lines:
inline void foo(int n) {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
// Runtime version
void foo_rt(int n) { foo(n); }
// Compiletime version
template<int n>
void foo_ct() { foo(n); }
But I am not sure if all compilers are smart enough to deal with this. Is there a better way?
EDIT:
Clearly, one solution that will work is to use macros, but this I really want to avoid:
#define foo_body \
{ \
for(int i=0; i<n; ++i) { \
// ... do something relatively cheap ... \
} \
}
// Runtime version
void foo_rt(int n) foo_body
// Compiletime version
template<int n>
void foo_ct() foo_body

I've done this before, using a integral_variable type and std::integral_constant. This looks like a lot of code, but if you look again, it's actually only a series of four very simple pieces, one of which is merely demo code.
#include <type_traits>
//type for acting like integeral_constant but with a variable
template<class underlying>
struct integral_variable {
const underlying value;
integral_variable(underlying v) :value(v) {}
};
//generic function
template<class value>
void foo(value n) {
for(int i=0; i<n.value; ++i) {
// ... do something relatively cheap ...
}
}
//optional: specialize so callers don't have to do casts
void foo_rt(int n) { return foo(integral_variable<int>(n)); }
template<int n>
void foo_ct() { return foo(std::integral_constant<unsigned, n>()); }
//notice it even handles different underlying types. Doesn't care.
//usage is simple
int main() {
foo_rt(3);
foo_ct<17>();
}

Much as I admire the DRY principle, I don't think there's a way around writing it twice.
Even though the code is the same, these are two very different operations -- working with a known value versus working with an unknown value.
You want to put the known one on a fast track to optimization that the unknown one may not qualify for.
What I would do is factor out all the code that does not depend on n into another function (which hopefully is the entire body of your for loop), and then have both your templated and non-templated versions call that within their loops. That way, the only thing you're repeating is the structure of the for loop, which I wouldn't consider a big deal.

If a value is known at compile time routing it through a template as a template parameter doesn't make it any more known at compile time. I think it's very unlikely that there are any compilers out there that will inline and optimize a function simply because the variable is a template parameter rather than some other kind of compile time constant.
Depending on your compiler you may not even need two versions of the function. An optimizing compiler may well just be able to optimize a function called with constant expression parameters. For example:
extern volatile int *I;
void foo(int n) {
for (int i=0;i<n;++i)
*I = i;
}
int main(int argc,char *[]) {
foo(4);
foo(argc);
}
My compiler turns this into an inlined, unrolled loop from 0 to 3, followed by an inlined loop on argc:
main: # #main
# BB#0: # %entry
movq I(%rip), %rax
movl $0, (%rax)
movl $1, (%rax)
movl $2, (%rax)
movl $3, (%rax)
testl %ecx, %ecx
jle .LBB1_3
# BB#1: # %for.body.lr.ph.i
xorl %eax, %eax
movq I(%rip), %rdx
.align 16, 0x90
.LBB1_2: # %for.body.i4
# =>This Inner Loop Header: Depth=1
movl %eax, (%rdx)
incl %eax
cmpl %eax, %ecx
jne .LBB1_2
.LBB1_3: # %_Z3fooi.exit5
xorl %eax, %eax
ret
To get such optimizations you either need to ensure the definition is available to all translation units (e.g., by defining the function as inline in a header file), or have a compiler that does link-time optimization.
If you use this and you're really depending on some things being computed at compile time then you should have automated tests to verify it gets done.
C++11 provides constexpr, which allows you to write a function that will be computed at compile time when given constexpr parameters, or to guarantee that a value is computed at compile time. There are restriction on what can go in a constexpr function, which may make it difficult to implement your function as a constexpr, but the allowed language is apparently turing complete. One issue is that, while the restrictions guarantee that the computation can be done at compile time if given constexpr parameters, those restrictions may result in an inefficient implementation for when the parameters are not constexpr.

Why not something like:
template<typename getter>
void f(getter g)
{
for (int i =0; i < g.get(); i++) { blah(); }
}
struct getter1 { inline constexpr int get() { return 1; } }
struct getterN { getterN(): _N(N) {} inline constexpr int get() { return k; } }
f<getter1>(getter1());
f<getterN>(getterN(100));

I would point out that if you have:
// Runtime version
void foo_rt(int n){ foo(n);}
... and this works for you, then you do in fact know the type at compile time. At least, you know a type that it is covariant with, and that's all you need to know. You can just use the templated version. If need be, you can specify the type at the call site, like this:
foo_rt<int>()

Related

Why can't GCC assume that std::vector::size won't change in this loop?

I claimed to a coworker that if (i < input.size() - 1) print(0); would get optimized in this loop so that input.size() is not read in every iteration, but it turns out that this is not the case!
void print(int x) {
std::cout << x << std::endl;
}
void print_list(const std::vector<int>& input) {
int i = 0;
for (size_t i = 0; i < input.size(); i++) {
print(input[i]);
if (i < input.size() - 1) print(0);
}
}
According to the Compiler Explorer with gcc options -O3 -fno-exceptions we are actually reading input.size() each iteration and using lea to perform a subtraction!
movq 0(%rbp), %rdx
movq 8(%rbp), %rax
subq %rdx, %rax
sarq $2, %rax
leaq -1(%rax), %rcx
cmpq %rbx, %rcx
ja .L35
addq $1, %rbx
Interestingly, in Rust this optimization does occur. It looks like i gets replaced with a variable j that is decremented each iteration, and the test i < input.size() - 1 is replaced with something like j > 0.
fn print(x: i32) {
println!("{}", x);
}
pub fn print_list(xs: &Vec<i32>) {
for (i, x) in xs.iter().enumerate() {
print(*x);
if i < xs.len() - 1 {
print(0);
}
}
}
In the Compiler Explorer the relevant assembly looks like this:
cmpq %r12, %rbx
jae .LBB0_4
I checked and I am pretty sure r12 is xs.len() - 1 and rbx is the counter. Earlier there is an add for rbx and a mov outside of the loop into r12.
Why is this? It seems like if GCC is able to inline the size() and operator[] as it did, it should be able to know that size() does not change. But maybe GCC's optimizer judges that it is not worth pulling it out into a variable? Or maybe there is some other possible side effect that would make this unsafe--does anyone know?
The non-inline function call to cout.operator<<(int) is a black box for the optimizer (because the library is just written in C++ and all the optimizer sees is a prototype; see discussion in comments). It has to assume any memory that could possibly be pointed to by a global var has been modified.
(Or the std::endl call. BTW, why force a flush of cout at that point instead of just printing a '\n'?)
e.g. for all it knows, std::vector<int> &input is a reference to a global variable, and one of those function calls modifies that global var. (Or there's a global vector<int> *ptr somewhere, or there's a function that returns a pointer to a static vector<int> in some other compilation unit, or some other way that a function could get a reference to this vector without being passed a reference to it by us.
If you had a local variable whose address had never been taken, the compiler could assume that non-inline function calls couldn't mutate it. Because there'd be no way for any global variable to hold a pointer to this object. (This is called Escape Analysis). That's why the compiler can keep size_t i in a register across function calls. (int i can just get optimized away because it's shadowed by size_t i and not used otherwise).
It could do the same with a local vector (i.e. for the base, end_size and end_capacity pointers.)
ISO C99 has a solution for this problem: int *restrict foo. Many C++ compiles support int *__restrict foo to promise that memory pointed to by foo is only accessed via that pointer. Most commonly useful in functions that take 2 arrays, and you want to promise the compiler they don't overlap. So it can auto-vectorize without generating code to check for that and run a fallback loop.
The OP comments:
In Rust a non-mutable reference is a global guarantee that no one else is mutating the value you have a reference to (equivalent to C++ restrict)
That explains why Rust can make this optimization but C++ can't.
Optimizing your C++
Obviously you should use auto size = input.size(); once at the top of your function so the compiler knows it's a loop invariant. C++ implementations don't solve this problem for you, so you have to do it yourself.
You might also need const int *data = input.data(); to hoist loads of the data pointer from the std::vector<int> "control block" as well. It's unfortunate that optimizing can require very non-idiomatic source changes.
Rust is a much more modern language, designed after compiler developers learned what was possible in practice for compilers. It really shows in other ways, too, including portably exposing some of the cool stuff CPUs can do via i32.count_ones, rotate, bit-scan, etc. It's really dumb that ISO C++ still doesn't expose any of these portably, except std::bitset::count().

Efficiency of passing a struct to a function without instantiating a local variable

I recently learned that I can do the following with passing a a struct to a function in C++:
(My apologies for not using a more appropriate name for this "feature" in the title, feel free to correct me)
#include <iostream>
typedef struct mystruct{
int data1;
int data2;
} MYSTRUCT;
void myfunction( MYSTRUCT _struct ){
std::cout << _struct.data1 << _struct.data2;
}
int main(){
//This is what I recently learned
myfunction( MYSTRUCT{2,3} );
return 0;
}
This makes me wonder is this less costly than instantiating a local MYSTRUCT
and passing it by value to the function? Or is it just a convenient way to do the same only that the temporary variable is eliminated right afterwards?
For example adding this line #define KBIG 10000000, is this:
std::vector<MYSTRUCT> myvector1;
for (long long i = 0; i < KBIG; i++) {
myvector1.push_back(MYSTRUCT{ 1,1 });
}
Consistently faster than this:
std::vector<MYSTRUCT> myvector2;
for (long long i = 0; i < KBIG; i++) {
MYSTRUCT localstruct = { 1,1 };
myvector2.push_back(localstruct);
}
I tried testing it, but the results were pretty inconsistent, hovering around 9-12 seconds for each. Sometimes the first one would be faster, other times not. Of course, this could be due to all the background processes at the time I was testing.
Simplifying slightly and compiling to assembler:
extern void emit(int);
typedef struct mystruct{
int data1;
int data2;
} MYSTRUCT;
__attribute__((noinline))
void myfunction( MYSTRUCT _struct ){
emit(_struct.data1);
emit(_struct.data2);
}
int main(){
//This is what I recently learned
myfunction( MYSTRUCT{2,3} );
return 0;
}
with -O2 yields:
myfunction(mystruct):
pushq %rbx
movq %rdi, %rbx
call emit(int)
sarq $32, %rbx
movq %rbx, %rdi
popq %rbx
jmp emit(int)
main:
movabsq $12884901890, %rdi
subq $8, %rsp
call myfunction(mystruct)
xorl %eax, %eax
addq $8, %rsp
ret
What happened?
The compiler realised that the entire structure fits into a register and passed it by value that way.
moral of the story: express intent. Let the compiler worry about details.
If you need a copy, you need a copy. End of story.
If speed is of any concern, take measurements of copying vs. taking const ref (i.e., const MYSTRUCT& _struct). When you do measurements, make sure you do them <1> <2>, then <2> <1> to compensate for cache effect.
Suggestions: avoid using _ as the first char of parameter, as some reserved words start with it; also, do not capitalize struct.
If you want to speed up your code, I suggest you to pass the struct via const reference, as follow:
void myfunction (const MYSTRUCT& _struct)
{
std::cout << _struct.data1 << _struct.data2;
}
This will be much more faster than passing by value because instead of copying an entire struct, it will pass just its address.
(Ok, there will not be so much difference in this case, since your struct contains only 2 integers, but it you have >1000 bytes (for example) then there will be a notable difference)
Also, i suggest you to use to use std::vector<T>::emplace_back instead std::vector<T>::push_back
std::vector<MYSTRUCT> myvector1;
for (long long i = 0; i < KBIG; i++)
{
myvector1.emplace_back (1, 1);
}
emplace_back forwards its arguments to the constructor of mystruct, so the compiler will not create an useless copy.

CPU overhead for struct?

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;
"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret
No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.
My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.
The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.
If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Performance hit of vtable lookup in C++

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).
An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:
// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);
// Array of pointers to structs containing each function's parameters.
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.
Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:
class Base
{
virtual void Execute();
int something_all_things_need;
}
class Derived1 : Base
{
void Execute() { /* Do something */ }
int own_parameter;
// Other own parameters
}
class Derived2 : Base { /* Etc. */ }
Base *todolist[200];
void realtime(void)
{
for (int i = 0; i < 200; i++)
todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}
My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...
What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.
Update (including correction of where the loop starts):
The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:
C:
i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;
Here's the actual loop:
r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);
C++ :
i5=0xb279a;
r15=0xc8;
Here's the actual loop:
i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);
In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.
Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...
What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.
I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.
C code (based on OP):
void (*todolist[200])(void *parameters);
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
C++ code:
class Base {
public:
Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
virtual void operator()() = 0;
protected:
void* unsafe_pointer_;
};
Base* todolist[200];
void realtime() {
for (int i = 0; i < 200; ++i)
(*todolist[i])();
}
Both compiled with gcc 4.8, -O3:
realtime: |_Z8realtimev:
.LFB0: |.LFB3:
.cfi_startproc | .cfi_startproc
pushq %rbx | pushq %rbx
.cfi_def_cfa_offset 16 | .cfi_def_cfa_offset 16
.cfi_offset 3, -16 | .cfi_offset 3, -16
xorl %ebx, %ebx | movl $todolist, %ebx
.p2align 4,,10 | .p2align 4,,10
.p2align 3 | .p2align 3
.L3: |.L3:
movq paramlist(%rbx), %rdi | movq (%rbx), %rdi
call *todolist(%rbx) | addq $8, %rbx
addq $8, %rbx | movq (%rdi), %rax
| call *(%rax)
cmpq $1600, %rbx | cmpq $todolist+1600, %rbx
jne .L3 | jne .L3
popq %rbx | popq %rbx
.cfi_def_cfa_offset 8 | .cfi_def_cfa_offset 8
ret | ret
In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.
According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).
# Initialization.
# i3=todolist; i5=paramlist | # i5=todolist holds paramlist
i3=0xb27ba; | # No paramlist in C++
i5=0xb28e6; | i5=0xb279a;
# r15=count
r15=0xc8; | r15=0xc8;
# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++; | i5=modify(i5,m6); # i4 = *todolist++
| i4=dm(m7,i5); # ..
# Note 2:
| r2=i4; # r2 = obj
| i4=dm(m6,i4); # vtable = *(obj + 1)
| r1=dm(0x3,i4); # r1 = vtable[3]
| r4=r2+r1; # param = obj + r1
i12=dm(i3,m6); # i12 = *todolist++; | i12=dm(0x5,i4); # i12 = vtable[5]
# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6; | r2=i6;
i6=i7; | i6=i7;
jump (m13,i12) (db); | jump (m13,i12) (db);
dm(i7,m7)=r2; | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de; | dm(i7,m7)=0x1279e2;
# if (count--) loop
r15=r15-1; | r15=r15-1;
if ne jump (pc, 0xfffffff2); | if ne jump (pc, 0xffffffe7);
Notes:
In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.
So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?
The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.
There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)
As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)
(Original test, before update):
I tried the following simple case (no multiple inheritance, which can slow things down):
class Base {
public:
Base(int val) : val_(val) {}
virtual int binary(int a, int b) = 0;
virtual int unary(int a) = 0;
virtual int nullary() = 0;
protected:
int val_;
};
int binary(Base* begin, Base* end, int a, int b) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->binary(a, b); }
return accum;
}
int unary(Base* begin, Base* end, int a) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->unary(a); }
return accum;
}
int nullary(Base* begin, Base* end) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->nullary(); }
return accum;
}
And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:
.L9:
movq (%rbx), %rax
movq %rbx, %rdi
addq $16, %rbx
movl %r13d, %esi
call *8(%rax)
addl %eax, %ebp
cmpq %rbx, %r12
jne .L9
As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:
template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
void execute() {
// I construct temporary objects here since I could not figure out how you
// construct your objects. You can change these signatures to allow for
// passing arbitrary params to these handlers.
FirstCb().execute();
InterruptHandler<RestCb...>().execute();
}
}
InterruptHandler</* Base, Derived1, and so on */> handler;
void realtime(void) {
handler.execute();
}
This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.
Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same.
Also, this requires that you have a C++11 compliant compiler.
I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.
You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.
Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.
Also, can you move any of the functionality out of the ISR?
Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.
You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:
#include <iostream>
template<class ParamType,class F>
void fun(void* param) {
F f;
f(*static_cast<ParamType*>(param));
}
struct my_function {
void operator()(int& i) {
std::cout << "got it " << i << std::endl;
}
};
int main() {
void (*func)(void*) = fun<int, my_function>;
int j=4;
func(&j);
return 0;
}
In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.
In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).
This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:
void (*todo[3])(void *);
void *param[3];
void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}
template<int N>
struct Obj {
static void apply()
{
todo[N-1](param[N-1]);
Obj<N-1>::apply();
}
};
template<> struct Obj<0> { static void apply() {} };
todo[0] = f1;
todo[1] = f2;
todo[2] = f3;
Obj<sizeof todo / sizeof *todo>::apply();
Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

Will the compiler unroll this loop?

I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?
template <typename T, u32 size>
class Vector
{
public:
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(const Vector<T, size>& vec)
{
for (u32 i = 0; i < size; ++i)
{
values[i] += vec[i];
}
}
private:
T values[size];
};
Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.
You can do the following trick with disassembly to see how the particular code is compiled.
Vector<int, 16> a, b;
Vector<int, 65536> c, d;
asm("xxx"); // marker
a.Add(b);
asm("yyy"); // marker
c.Add(d);
asm("zzz"); // marker
Now compile
gcc -O3 1.cc -S -o 1.s
And see the disasm
xxx
# 0 "" 2
#NO_APP
movdqa 524248(%rsp), %xmm0
leaq 524248(%rsp), %rsi
paddd 524184(%rsp), %xmm0
movdqa %xmm0, 524248(%rsp)
movdqa 524264(%rsp), %xmm0
paddd 524200(%rsp), %xmm0
movdqa %xmm0, 524264(%rsp)
movdqa 524280(%rsp), %xmm0
paddd 524216(%rsp), %xmm0
movdqa %xmm0, 524280(%rsp)
movdqa 524296(%rsp), %xmm0
paddd 524232(%rsp), %xmm0
movdqa %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
yyy
# 0 "" 2
#NO_APP
leaq 262040(%rsp), %rdx
leaq -104(%rsp), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqa (%rcx,%rax), %xmm0
paddd (%rdx,%rax), %xmm0
movdqa %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $262144, %rax
jne .L2
#APP
# 38 "1.cc" 1
zzz
As you see, the first loop was small enough to get unrolled. The second is the loop.
First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).
Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.
Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.
Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.
The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.
I adjusted your example slightly, so that it would compile.
typedef unsigned u32; // or something similar
template <typename T, u32 size>
class Vector
{
// need to use an inner class, because member templates of an
// unspecialized template cannot be explicitly specialized.
template<typename Vec, u32 index>
struct Inner
{
static void add(const Vec& a, const Vec& b)
{
a.values[index] = b.values[index];
// triggers recursive instantiation of Inner
Inner<Vec, index-1>::add(a,b);
}
};
// this specialization terminates the recursion
template<typename Vec>
struct Inner<Vec, 0>
{
static void add(const Vec& a, const Vec& b)
{
a.values[0] = b.values[0];
}
};
public:
// PS! this function should probably take a
// _const_ Vector, because the argument is not modified
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(Vector<T, size>& vec)
{
Inner<Vector, size-1>::add(*this, vec);
}
T values[size];
};
The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a.add( b );
}
then a reference code snippet reference.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
}
and now use GCC to compile them and spit out only the assembly:
for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done
In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops will cause it to unroll even more.
First of all, it is not at all certain that unrolling the loop would be beneficial.
The only possible answer to your question is "it depends" (on the compiler flags, on the value of size, etc).
If you really want to know, ask your compiler: compile into assembly code with typical values of size and with the optimization flags you'd use for real, and examine the result.
Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.
If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).