Performance difference between POD and non-POD classes

Performance difference between POD and non-POD classes - c++

I'm struggling to understand why my compilers (g++ 8.1.0 and clang++ 6.0.0) treat POD (plain-old-data) and non-POD code differently.
Test code:
#include <iostream>
struct slong {
int i;
~slong() { i = 0; }
};
int get1(slong x) { return 1+x.i; }
int main() {
std::cerr << "is_pod(slong) = " << std::is_pod<slong>::value << std::endl;
}
defines a class slong with a destructor (hence not POD) and the compiler, with -Ofast, will produce for get1
movl (%rdi), %eax
incl %eax
but when I comment out the destructor (so slong becomes POD) I get
leal 1(%rdi), %eax
Of course the performance issue is minor; still I'd like to understand. In other (more complicated) cases I also noticed more significant code differences.

Note that movl accesses memory, while leal doesn't.
When passing a struct to a function by value, ABI can stuff it into a register (rdi) if it's POD.
If the struct is not POD, ABI must pass it on stack (presumably because the code may need its address to call the destructor, access the vtable and do other complicated stuff). So accessing its member requires indirection.

Related

Efficiency of passing a struct to a function without instantiating a local variable

I recently learned that I can do the following with passing a a struct to a function in C++:
(My apologies for not using a more appropriate name for this "feature" in the title, feel free to correct me)
#include <iostream>
typedef struct mystruct{
int data1;
int data2;
} MYSTRUCT;
void myfunction( MYSTRUCT _struct ){
std::cout << _struct.data1 << _struct.data2;
}
int main(){
//This is what I recently learned
myfunction( MYSTRUCT{2,3} );
return 0;
}
This makes me wonder is this less costly than instantiating a local MYSTRUCT
and passing it by value to the function? Or is it just a convenient way to do the same only that the temporary variable is eliminated right afterwards?
For example adding this line #define KBIG 10000000, is this:
std::vector<MYSTRUCT> myvector1;
for (long long i = 0; i < KBIG; i++) {
myvector1.push_back(MYSTRUCT{ 1,1 });
}
Consistently faster than this:
std::vector<MYSTRUCT> myvector2;
for (long long i = 0; i < KBIG; i++) {
MYSTRUCT localstruct = { 1,1 };
myvector2.push_back(localstruct);
}
I tried testing it, but the results were pretty inconsistent, hovering around 9-12 seconds for each. Sometimes the first one would be faster, other times not. Of course, this could be due to all the background processes at the time I was testing.

Simplifying slightly and compiling to assembler:
extern void emit(int);
typedef struct mystruct{
int data1;
int data2;
} MYSTRUCT;
__attribute__((noinline))
void myfunction( MYSTRUCT _struct ){
emit(_struct.data1);
emit(_struct.data2);
}
int main(){
//This is what I recently learned
myfunction( MYSTRUCT{2,3} );
return 0;
}
with -O2 yields:
myfunction(mystruct):
pushq %rbx
movq %rdi, %rbx
call emit(int)
sarq $32, %rbx
movq %rbx, %rdi
popq %rbx
jmp emit(int)
main:
movabsq $12884901890, %rdi
subq $8, %rsp
call myfunction(mystruct)
xorl %eax, %eax
addq $8, %rsp
ret
What happened?
The compiler realised that the entire structure fits into a register and passed it by value that way.
moral of the story: express intent. Let the compiler worry about details.
If you need a copy, you need a copy. End of story.

If speed is of any concern, take measurements of copying vs. taking const ref (i.e., const MYSTRUCT& _struct). When you do measurements, make sure you do them <1> <2>, then <2> <1> to compensate for cache effect.
Suggestions: avoid using _ as the first char of parameter, as some reserved words start with it; also, do not capitalize struct.

If you want to speed up your code, I suggest you to pass the struct via const reference, as follow:
void myfunction (const MYSTRUCT& _struct)
{
std::cout << _struct.data1 << _struct.data2;
}
This will be much more faster than passing by value because instead of copying an entire struct, it will pass just its address.
(Ok, there will not be so much difference in this case, since your struct contains only 2 integers, but it you have >1000 bytes (for example) then there will be a notable difference)
Also, i suggest you to use to use std::vector<T>::emplace_back instead std::vector<T>::push_back
std::vector<MYSTRUCT> myvector1;
for (long long i = 0; i < KBIG; i++)
{
myvector1.emplace_back (1, 1);
}
emplace_back forwards its arguments to the constructor of mystruct, so the compiler will not create an useless copy.

Volatile not working as expected

Consider this code:
struct A{
volatile int x;
A() : x(12){
}
};
A foo(){
A ret;
//Do stuff
return ret;
}
int main()
{
A a;
a.x = 13;
a = foo();
}
Using g++ -std=c++14 -pedantic -O3 I get this assembly:
foo():
movl $12, %eax
ret
main:
xorl %eax, %eax
ret
According to my estimation the variable x should be written to at least three times (possibly four), yet it not even written once (the function foo isn't even called!)
Even worse when you add the inline keyword to foo this is the result:
main:
xorl %eax, %eax
ret
I thought that volatile means that every single read or write must happen even if the compiler can not see the point of the read/write.
What is going on here?
Update:
Putting the declaration of A a; outside main like this:
A a;
int main()
{
a.x = 13;
a = foo();
}
Generates this code:
foo():
movl $12, %eax
ret
main:
movl $13, a(%rip)
xorl %eax, %eax
movl $12, a(%rip)
ret
movl $12, a(%rip)
ret
a:
.zero 4
Which is closer to what you would expect....I am even more confused then ever

Visual C++ 2015 does not optimize away the assignments:
A a;
mov dword ptr [rsp+8],0Ch <-- write 1
a.x = 13;
mov dword ptr [a],0Dh <-- write2
a = foo();
mov dword ptr [a],0Ch <-- write3
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
}
xor eax,eax
ret
The same happens both with /O2 (Maximize speed) and /Ox (Full optimization).
The volatile writes are kept also by gcc 3.4.4 using both -O2 and -O3
_main:
pushl %ebp
movl $16, %eax
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call __alloca
call ___main
movl $12, -4(%ebp) <-- write1
xorl %eax, %eax
movl $13, -4(%ebp) <-- write2
movl $12, -8(%ebp) <-- write3
leave
ret
Using both of these compilers, if I remove the volatile keyword, main() becomes essentially empty.
I'd say you have a case where the compiler over-agressively (and incorrectly IMHO) decides that since 'a' is not used, operations on it arent' necessary and overlooks the volatile member. Making 'a' itself volatile could get you what you want, but as I don't have a compiler that reproduces this, I can't say for sure.
Last (while this is admittedly Microsoft specific), https://msdn.microsoft.com/en-us/library/12a04hfd.aspx says:
If a struct member is marked as volatile, then volatile is propagated to the whole structure.
Which also points towards the behavior you are seeing being a compiler problem.
Last, if you make 'a' a global variable, it is somewhat understandable that the compiler is less eager to deem it unused and drop it. Global variables are extern by default, so it is not possible to say that a global 'a' is unused just by looking at the main function. Some other compilation unit (.cpp file) might be using it.

GCC's page on Volatile access gives some insight into how it works:
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point. The use of volatile does not allow you to violate the restriction on updating objects multiple times between two sequence points.
In C standardese:
§5.1.2.3
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, 11) which are changes in the state of the
execution environment. Evaluation of an expression may produce side
effects. At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have
taken place. (A summary of the sequence points is given in annex C.)
3 In the abstract machine, all expressions are evaluated as specified
by the semantics. An actual implementation need not evaluate part of
an expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
[...]
5 The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet
occurred. [...]
I chose the C standard because the language is simpler but the rules are essentially the same in C++. See the "as-if" rule.
Now on my machine, -O1 doesn't optimize away the call to foo(), so let's use -fdump-tree-optimized to see the difference:
-O1
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
a = foo ();
a ={v} {CLOBBER};
return 0;
}
And -O3:
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A ret;
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
ret.x ={v} 12;
ret ={v} {CLOBBER};
a ={v} {CLOBBER};
return 0;
}
gdb reveals in both cases that a is ultimately optimized out, but we're worried about foo(). The dumps show us that GCC reordered the accesses so that foo() is not even necessary and subsequently all of the code in main() is optimized out. Is this really true? Let's see the assembly output for -O1:
foo():
mov eax, 12
ret
main:
call foo()
mov eax, 0
ret
This essentially confirms what I said above. Everything is optimized out: the only difference is whether or not the call to foo() is as well.

CPU overhead for struct?

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;

"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret

No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.

My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.

The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.

If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Most efficient standard-compliant way of reinterpreting int as float

Assume I have guarantees that float is IEEE 754 binary32. Given a bit pattern that corresponds to a valid float, stored in std::uint32_t, how does one reinterpret it as a float in a most efficient standard compliant way?
float reinterpret_as_float(std::uint32_t ui) {
return /* apply sorcery to ui */;
}
I've got a few ways that I know/suspect/assume have some issues:
Via reinterpret_cast,
float reinterpret_as_float(std::uint32_t ui) {
return reinterpret_cast<float&>(ui);
}
or equivalently
float reinterpret_as_float(std::uint32_t ui) {
return *reinterpret_cast<float*>(&ui);
}
which suffers from aliasing issues.
Via union,
float reinterpret_as_float(std::uint32_t ui) {
union {
std::uint32_t ui;
float f;
} u = {ui};
return u.f;
}
which is not actually legal, as it is only allowed to read from most recently written to member. Yet, it seems some compilers (gcc) allow this.
Via std::memcpy,
float reinterpret_as_float(std::uint32_t ui) {
float f;
std::memcpy(&f, &ui, 4);
return f;
}
which AFAIK is legal, but a function call to copy single word seems wasteful, though it might get optimized away.
Via reinterpret_casting to char* and copying,
float reinterpret_as_float(std::uint32_t ui) {
char* uip = reinterpret_cast<char*>(&ui);
float f;
char* fp = reinterpret_cast<char*>(&f);
for (int i = 0; i < 4; ++i) {
fp[i] = uip[i];
}
return f;
}
which AFAIK is also legal, as char pointers are exempt from aliasing issues and manual byte copying loop saves a possible function call. The loop will most definitely be unrolled, yet 4 possibly separate one-byte loads/stores are worrisome, I have no idea whether this is optimizable to single four byte load/store.
The 4 is the best I've been able to come up with.
Am I correct so far? Is there a better way to do this, particulary one that will guarantee single load/store?

Afaik, there are only two approaches that are compliant with strict aliasing rules: memcpy() and cast to char* with copying. All others read a float from memory that belongs to an uint32_t, and the compiler is allowed to perform the read before the write to that memory location. It might even optimize away the write altogether as it can prove that the stored value will never be used according to strict aliasing rules, resulting in a garbage return value.
It really depends on the compiler/optimizes whether memcpy() or char* copy is faster. In both cases, an intelligent compiler might be able to figure out that it can just load and copy an uint32_t, but I would not trust any compiler to do so before I have seen it in the resulting assembler code.
Edit:
After some testing with gcc 4.8.1, I can say that the memcpy() approach is the best for this particulare compiler, see below for details.
Compiling
#include <stdint.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
for( int i = sizeof(a); i--; ) bPointer[i] = aPointer[i];
return b;
}
with gcc -S -std=gnu11 -O3 foo.c yields this assemble code:
movl %edi, %ecx
movl %edi, %edx
movl %edi, %eax
shrl $24, %ecx
shrl $16, %edx
shrw $8, %ax
movb %cl, -1(%rsp)
movb %dl, -2(%rsp)
movb %al, -3(%rsp)
movb %dil, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is not optimal.
Doing the same with
#include <stdint.h>
#include <string.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
memcpy(bPointer, aPointer, sizeof(a));
return b;
}
yields (with all optimization levels except -O0):
movl %edi, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is optimal.

If the bitpattern in the integer variable is the same as a valid float value, then union is probably the best and most compliant way to go. And it's actually legal if you read the specification (don't remember the section at the moment).

memcpy is always safe but does involve a copy
casting may lead to problems
union - seems to be allowed in C99 and C11, not sure about C++
Take a look at:
What is the strict aliasing rule?
and
Is type-punning through a union unspecified in C99, and has it become specified in C11?

float reinterpret_as_float(std::uint32_t ui) {
return *((float *)&ui);
}
As plain function, its code is translated into assembly as this (Pelles C for Windows):
fld [esp+4]
ret
If defined as inline function, then a code like this (n being unsigned, x being float):
x = reinterpret_as_float (n);
Is translated to assembler as this:
fld [ebp-4] ;RHS of asignment. Read n as float
fstp dword ptr [ebp-8] ;LHS of asignment

Performance hit of vtable lookup in C++

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).
An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:
// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);
// Array of pointers to structs containing each function's parameters.
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.
Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:
class Base
{
virtual void Execute();
int something_all_things_need;
}
class Derived1 : Base
{
void Execute() { /* Do something */ }
int own_parameter;
// Other own parameters
}
class Derived2 : Base { /* Etc. */ }
Base *todolist[200];
void realtime(void)
{
for (int i = 0; i < 200; i++)
todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}
My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...
What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.
Update (including correction of where the loop starts):
The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:
C:
i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;
Here's the actual loop:
r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);
C++ :
i5=0xb279a;
r15=0xc8;
Here's the actual loop:
i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);
In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.
Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.
I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.
C code (based on OP):
void (*todolist[200])(void *parameters);
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
C++ code:
class Base {
public:
Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
virtual void operator()() = 0;
protected:
void* unsafe_pointer_;
};
Base* todolist[200];
void realtime() {
for (int i = 0; i < 200; ++i)
(*todolist[i])();
}
Both compiled with gcc 4.8, -O3:
realtime: |_Z8realtimev:
.LFB0: |.LFB3:
.cfi_startproc | .cfi_startproc
pushq %rbx | pushq %rbx
.cfi_def_cfa_offset 16 | .cfi_def_cfa_offset 16
.cfi_offset 3, -16 | .cfi_offset 3, -16
xorl %ebx, %ebx | movl $todolist, %ebx
.p2align 4,,10 | .p2align 4,,10
.p2align 3 | .p2align 3
.L3: |.L3:
movq paramlist(%rbx), %rdi | movq (%rbx), %rdi
call *todolist(%rbx) | addq $8, %rbx
addq $8, %rbx | movq (%rdi), %rax
| call *(%rax)
cmpq $1600, %rbx | cmpq $todolist+1600, %rbx
jne .L3 | jne .L3
popq %rbx | popq %rbx
.cfi_def_cfa_offset 8 | .cfi_def_cfa_offset 8
ret | ret
In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.
According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).
# Initialization.
# i3=todolist; i5=paramlist | # i5=todolist holds paramlist
i3=0xb27ba; | # No paramlist in C++
i5=0xb28e6; | i5=0xb279a;
# r15=count
r15=0xc8; | r15=0xc8;
# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++; | i5=modify(i5,m6); # i4 = *todolist++
| i4=dm(m7,i5); # ..
# Note 2:
| r2=i4; # r2 = obj
| i4=dm(m6,i4); # vtable = *(obj + 1)
| r1=dm(0x3,i4); # r1 = vtable[3]
| r4=r2+r1; # param = obj + r1
i12=dm(i3,m6); # i12 = *todolist++; | i12=dm(0x5,i4); # i12 = vtable[5]
# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6; | r2=i6;
i6=i7; | i6=i7;
jump (m13,i12) (db); | jump (m13,i12) (db);
dm(i7,m7)=r2; | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de; | dm(i7,m7)=0x1279e2;
# if (count--) loop
r15=r15-1; | r15=r15-1;
if ne jump (pc, 0xfffffff2); | if ne jump (pc, 0xffffffe7);
Notes:
In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.
So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?
The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.
There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)
As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)
(Original test, before update):
I tried the following simple case (no multiple inheritance, which can slow things down):
class Base {
public:
Base(int val) : val_(val) {}
virtual int binary(int a, int b) = 0;
virtual int unary(int a) = 0;
virtual int nullary() = 0;
protected:
int val_;
};
int binary(Base* begin, Base* end, int a, int b) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->binary(a, b); }
return accum;
}
int unary(Base* begin, Base* end, int a) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->unary(a); }
return accum;
}
int nullary(Base* begin, Base* end) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->nullary(); }
return accum;
}
And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:
.L9:
movq (%rbx), %rax
movq %rbx, %rdi
addq $16, %rbx
movl %r13d, %esi
call *8(%rax)
addl %eax, %ebp
cmpq %rbx, %r12
jne .L9

As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:
template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
void execute() {
// I construct temporary objects here since I could not figure out how you
// construct your objects. You can change these signatures to allow for
// passing arbitrary params to these handlers.
FirstCb().execute();
InterruptHandler<RestCb...>().execute();
}
}
InterruptHandler</* Base, Derived1, and so on */> handler;
void realtime(void) {
handler.execute();
}
This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.
Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same.
Also, this requires that you have a C++11 compliant compiler.

I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.
You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.
Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.
Also, can you move any of the functionality out of the ISR?
Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.

You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:
#include <iostream>
template<class ParamType,class F>
void fun(void* param) {
F f;
f(*static_cast<ParamType*>(param));
}
struct my_function {
void operator()(int& i) {
std::cout << "got it " << i << std::endl;
}
};
int main() {
void (*func)(void*) = fun<int, my_function>;
int j=4;
func(&j);
return 0;
}
In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.
In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).

This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:
void (*todo[3])(void *);
void *param[3];
void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}
template<int N>
struct Obj {
static void apply()
{
todo[N-1](param[N-1]);
Obj<N-1>::apply();
}
};
template<> struct Obj<0> { static void apply() {} };
todo[0] = f1;
todo[1] = f2;
todo[2] = f3;
Obj<sizeof todo / sizeof *todo>::apply();

Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Performance difference between POD and non-POD classes - c++

Related

Efficiency of passing a struct to a function without instantiating a local variable

Volatile not working as expected

CPU overhead for struct?

Most efficient standard-compliant way of reinterpreting int as float

Performance hit of vtable lookup in C++

Categories

Resources