Performance hit of vtable lookup in C++

Performance hit of vtable lookup in C++ - c++

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).
An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:
// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);
// Array of pointers to structs containing each function's parameters.
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.
Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:
class Base
{
virtual void Execute();
int something_all_things_need;
}
class Derived1 : Base
{
void Execute() { /* Do something */ }
int own_parameter;
// Other own parameters
}
class Derived2 : Base { /* Etc. */ }
Base *todolist[200];
void realtime(void)
{
for (int i = 0; i < 200; i++)
todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}
My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...
What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.
Update (including correction of where the loop starts):
The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:
C:
i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;
Here's the actual loop:
r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);
C++ :
i5=0xb279a;
r15=0xc8;
Here's the actual loop:
i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);
In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.
Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.
I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.
C code (based on OP):
void (*todolist[200])(void *parameters);
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
C++ code:
class Base {
public:
Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
virtual void operator()() = 0;
protected:
void* unsafe_pointer_;
};
Base* todolist[200];
void realtime() {
for (int i = 0; i < 200; ++i)
(*todolist[i])();
}
Both compiled with gcc 4.8, -O3:
realtime: |_Z8realtimev:
.LFB0: |.LFB3:
.cfi_startproc | .cfi_startproc
pushq %rbx | pushq %rbx
.cfi_def_cfa_offset 16 | .cfi_def_cfa_offset 16
.cfi_offset 3, -16 | .cfi_offset 3, -16
xorl %ebx, %ebx | movl $todolist, %ebx
.p2align 4,,10 | .p2align 4,,10
.p2align 3 | .p2align 3
.L3: |.L3:
movq paramlist(%rbx), %rdi | movq (%rbx), %rdi
call *todolist(%rbx) | addq $8, %rbx
addq $8, %rbx | movq (%rdi), %rax
| call *(%rax)
cmpq $1600, %rbx | cmpq $todolist+1600, %rbx
jne .L3 | jne .L3
popq %rbx | popq %rbx
.cfi_def_cfa_offset 8 | .cfi_def_cfa_offset 8
ret | ret
In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.
According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).
# Initialization.
# i3=todolist; i5=paramlist | # i5=todolist holds paramlist
i3=0xb27ba; | # No paramlist in C++
i5=0xb28e6; | i5=0xb279a;
# r15=count
r15=0xc8; | r15=0xc8;
# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++; | i5=modify(i5,m6); # i4 = *todolist++
| i4=dm(m7,i5); # ..
# Note 2:
| r2=i4; # r2 = obj
| i4=dm(m6,i4); # vtable = *(obj + 1)
| r1=dm(0x3,i4); # r1 = vtable[3]
| r4=r2+r1; # param = obj + r1
i12=dm(i3,m6); # i12 = *todolist++; | i12=dm(0x5,i4); # i12 = vtable[5]
# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6; | r2=i6;
i6=i7; | i6=i7;
jump (m13,i12) (db); | jump (m13,i12) (db);
dm(i7,m7)=r2; | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de; | dm(i7,m7)=0x1279e2;
# if (count--) loop
r15=r15-1; | r15=r15-1;
if ne jump (pc, 0xfffffff2); | if ne jump (pc, 0xffffffe7);
Notes:
In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.
So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?
The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.
There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)
As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)
(Original test, before update):
I tried the following simple case (no multiple inheritance, which can slow things down):
class Base {
public:
Base(int val) : val_(val) {}
virtual int binary(int a, int b) = 0;
virtual int unary(int a) = 0;
virtual int nullary() = 0;
protected:
int val_;
};
int binary(Base* begin, Base* end, int a, int b) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->binary(a, b); }
return accum;
}
int unary(Base* begin, Base* end, int a) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->unary(a); }
return accum;
}
int nullary(Base* begin, Base* end) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->nullary(); }
return accum;
}
And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:
.L9:
movq (%rbx), %rax
movq %rbx, %rdi
addq $16, %rbx
movl %r13d, %esi
call *8(%rax)
addl %eax, %ebp
cmpq %rbx, %r12
jne .L9

As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:
template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
void execute() {
// I construct temporary objects here since I could not figure out how you
// construct your objects. You can change these signatures to allow for
// passing arbitrary params to these handlers.
FirstCb().execute();
InterruptHandler<RestCb...>().execute();
}
}
InterruptHandler</* Base, Derived1, and so on */> handler;
void realtime(void) {
handler.execute();
}
This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.
Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same.
Also, this requires that you have a C++11 compliant compiler.

I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.
You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.
Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.
Also, can you move any of the functionality out of the ISR?
Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.

You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:
#include <iostream>
template<class ParamType,class F>
void fun(void* param) {
F f;
f(*static_cast<ParamType*>(param));
}
struct my_function {
void operator()(int& i) {
std::cout << "got it " << i << std::endl;
}
};
int main() {
void (*func)(void*) = fun<int, my_function>;
int j=4;
func(&j);
return 0;
}
In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.
In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).

This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:
void (*todo[3])(void *);
void *param[3];
void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}
template<int N>
struct Obj {
static void apply()
{
todo[N-1](param[N-1]);
Obj<N-1>::apply();
}
};
template<> struct Obj<0> { static void apply() {} };
todo[0] = f1;
todo[1] = f2;
todo[2] = f3;
Obj<sizeof todo / sizeof *todo>::apply();

Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

Related

Why can't GCC assume that std::vector::size won't change in this loop?

I claimed to a coworker that if (i < input.size() - 1) print(0); would get optimized in this loop so that input.size() is not read in every iteration, but it turns out that this is not the case!
void print(int x) {
std::cout << x << std::endl;
}
void print_list(const std::vector<int>& input) {
int i = 0;
for (size_t i = 0; i < input.size(); i++) {
print(input[i]);
if (i < input.size() - 1) print(0);
}
}
According to the Compiler Explorer with gcc options -O3 -fno-exceptions we are actually reading input.size() each iteration and using lea to perform a subtraction!
movq 0(%rbp), %rdx
movq 8(%rbp), %rax
subq %rdx, %rax
sarq $2, %rax
leaq -1(%rax), %rcx
cmpq %rbx, %rcx
ja .L35
addq $1, %rbx
Interestingly, in Rust this optimization does occur. It looks like i gets replaced with a variable j that is decremented each iteration, and the test i < input.size() - 1 is replaced with something like j > 0.
fn print(x: i32) {
println!("{}", x);
}
pub fn print_list(xs: &Vec<i32>) {
for (i, x) in xs.iter().enumerate() {
print(*x);
if i < xs.len() - 1 {
print(0);
}
}
}
In the Compiler Explorer the relevant assembly looks like this:
cmpq %r12, %rbx
jae .LBB0_4
I checked and I am pretty sure r12 is xs.len() - 1 and rbx is the counter. Earlier there is an add for rbx and a mov outside of the loop into r12.
Why is this? It seems like if GCC is able to inline the size() and operator[] as it did, it should be able to know that size() does not change. But maybe GCC's optimizer judges that it is not worth pulling it out into a variable? Or maybe there is some other possible side effect that would make this unsafe--does anyone know?

The non-inline function call to cout.operator<<(int) is a black box for the optimizer (because the library is just written in C++ and all the optimizer sees is a prototype; see discussion in comments). It has to assume any memory that could possibly be pointed to by a global var has been modified.
(Or the std::endl call. BTW, why force a flush of cout at that point instead of just printing a '\n'?)
e.g. for all it knows, std::vector<int> &input is a reference to a global variable, and one of those function calls modifies that global var. (Or there's a global vector<int> *ptr somewhere, or there's a function that returns a pointer to a static vector<int> in some other compilation unit, or some other way that a function could get a reference to this vector without being passed a reference to it by us.
If you had a local variable whose address had never been taken, the compiler could assume that non-inline function calls couldn't mutate it. Because there'd be no way for any global variable to hold a pointer to this object. (This is called Escape Analysis). That's why the compiler can keep size_t i in a register across function calls. (int i can just get optimized away because it's shadowed by size_t i and not used otherwise).
It could do the same with a local vector (i.e. for the base, end_size and end_capacity pointers.)
ISO C99 has a solution for this problem: int *restrict foo. Many C++ compiles support int *__restrict foo to promise that memory pointed to by foo is only accessed via that pointer. Most commonly useful in functions that take 2 arrays, and you want to promise the compiler they don't overlap. So it can auto-vectorize without generating code to check for that and run a fallback loop.
The OP comments:
In Rust a non-mutable reference is a global guarantee that no one else is mutating the value you have a reference to (equivalent to C++ restrict)
That explains why Rust can make this optimization but C++ can't.
Optimizing your C++
Obviously you should use auto size = input.size(); once at the top of your function so the compiler knows it's a loop invariant. C++ implementations don't solve this problem for you, so you have to do it yourself.
You might also need const int *data = input.data(); to hoist loads of the data pointer from the std::vector<int> "control block" as well. It's unfortunate that optimizing can require very non-idiomatic source changes.
Rust is a much more modern language, designed after compiler developers learned what was possible in practice for compilers. It really shows in other ways, too, including portably exposing some of the cool stuff CPUs can do via i32.count_ones, rotate, bit-scan, etc. It's really dumb that ISO C++ still doesn't expose any of these portably, except std::bitset::count().

Cost of dereferencing a variable on the stack, or dereferenced recently?

With a modern compiler, is it as expensive to dereference a pointer a second time, when the data it points to was dereferenced recently?
int * ptr = new int();
... lots of stuff...
*ptr = 1; // may need to load the memory into the cpu
*ptr = 2; // accessed again, can I assume this will usually be loaded and cost nothing extra?
What if the pointer addresses a variable on the stack, can I assume reading/writing through a pointer to a stack variable costs the same as reading/writing directly to the variable?
int var;
int * ptr = &var;
*ptr = 0; // will this cost the same as if I just said var = 0; ?
And finally, does this extend to more complicated things, such as manipulating a base object on the stack through an interface?
Base baseObject;
Derived * derivedObject = &baseObject;
derivedObject->value = 42; // will this have the same cost as if I just--
derivedObject->doSomething() // --manipulated baseObject directly?
Edit: I'm asking this to gain a deeper understanding; this is less a problem to be solved than it is a request for insight. Please don't worry about "premature-optimization" or other practical concerns, just give me all the rope you can :)

This question contains a number of ambiguities.
A simple rule of thumb is that dereferencing something will always have the same cost, except when it doesn't.
There are a number of factors in the cost of a dereference - is the destination in cache, is it paged and the code generated by the compiler.
For the code snippet
Obj* p = new Obj;
// <elided> //
p->something = 1;
looking at this source code we can't tell whether the executable will have ps value loaded, whether *p is in cache or whether *p has even been accessed.
Obj* p = new Obj;
p->something = 1;
We still can't be sure whether *p is paged/cached, but most modern compilers/optimizers will not emit code that retrieves p and stores it and then fetches it again.
In practice on modern hardware, you really shouldn't be concerned with it, and if you are, start by looking at the assembly.
I'll use two ends of the spectrum:
struct Obj { int something; int other; };
Obj* f() {
Obj* p = new Obj;
p->something = 1;
p->other = 2;
return p;
}
extern void fn2(Obj**);
Obj* h() {
Obj* p = new Obj;
fn2(&p);
p->something = 1;
fn2(&p);
p->other = 2;
return p;
}
This produces
f():
subq $8, %rsp
movl $8, %edi
call operator new(unsigned long)
movl $1, (%rax)
movl $2, 4(%rax)
addq $8, %rsp
ret
and
h():
subq $24, %rsp
movl $8, %edi
call operator new(unsigned long)
leaq 8(%rsp), %rdi
movq %rax, 8(%rsp)
call fn2(Obj**)
movq 8(%rsp), %rax
leaq 8(%rsp), %rdi
movl $1, (%rax)
call fn2(Obj**)
movq 8(%rsp), %rax
movl $2, 4(%rax)
addq $24, %rsp
ret
Here the compiler has to preserve and restore the pointer to dereference it after the call, but that's a bit unfair because the pointer could be modified by the called function.
Obj* h() {
Obj* p = new Obj;
fn2(nullptr);
p->something = 1;
fn2(nullptr);
p->other = 2;
return p;
}
produces
h():
pushq %rbx
movl $8, %edi
call operator new(unsigned long)
xorl %edi, %edi
movq %rax, %rbx
call fn2(Obj**)
xorl %edi, %edi
movl $1, (%rbx)
call fn2(Obj**)
movq %rbx, %rax
movl $2, 4(%rbx)
popq %rbx
ret
we're still seeing some register shenanigans, but it's hardly expensive.
As for your questions about pointers to the stack, a good optimizer will be able to eliminate those, but again you have to consult the assembly generated by your chosen compiler for your particular platform.
struct Obj { int something; int other; };
void fn(Obj*);
void f()
{
Obj o;
Obj* p = &o;
p->something = 1;
p->other = 1;
fn(p);
}
produces the following where p has basically been eliminated.
f():
subq $24, %rsp
movq %rsp, %rdi
movl $1, (%rsp)
movl $1, 4(%rsp)
call fn(Obj*)
addq $24, %rsp
ret
Of course, if we passed &p to something, the compiler wouldn't be able to elide it entirely, but it still might be smart enough to avoid using it when it didn't absolutely have to.

Trust the compiler.
Be pretty sure the compiler will generate code to do the less amount of work as possible, taking into account peculiarities of the CPU architecture, and anything the compiler can take into account.

With a modern compiler, is it as expensive to dereference a pointer a
second time, when the data it points to was dereferenced recently?
With full optimization, the compiler might be able to rearrange the code (depending on the code), or perhaps stash the pointer into a register, or perhaps stash the value in a register ... so maybe.
One might look at the generated assy code to confirm.
I considered this premature optimization.
Also, if you are thinking of cache, then possibly (but not guaranteed), when close together in time and when both mem addresses are in the same cache block, the two dereferences through the pointer will each access the cache mem without a cache miss.
Any writes will be placed in cache, and delivered to memory when the cache hw gets around to it or a cache miss causes the flush to memory.
What if the pointer addresses a variable on the stack, can I assume
reading/writing through a pointer to a stack variable costs the same
as reading/writing directly to the variable?
I doubt that you could or should assume anything. You might inspect the generated assy to see what your compiler did with this code on this target architecture and with this compiler version and build option choices, etc. Hundreds if not thousands of variables which might affect the code gen.
Note that data cache works for stack accesses, also.
Again, I consider this premature optimization.
And finally, does this extend to more complicated things, such as
manipulating a base object on the stack through an interface?
In general, the compiler does a good job. So, in that sense, possibly. Not guaranteed.
I think using move semantics (a C++ feature) is valuable, but that is maybe not relate-able to your questions here.
Hardware cache is probably more important than any amount of cycle counts you may wish to manually count (or simulate). I was impressed how much data cache (for automatic variables and dynamic variables) improved the performance on an embedded system. But the code cache was impressive too. I would not want to do with out either of them.
By premature optimization I mean that
a) humans are notoriously unable to understand (or 'guess') where the hot spots are in their programs, i.e. the 20% of the code consume 80% of cycles idea. That is why there are tools to help pin-point them.
b) I have always heard that the better algorithms outperform other choices, and I would say that is usually true. The better algorithms are what you should be learning. From your SO rep, you probably know more than I do.
However, I feel readability is the more appropriate criteria for evaluation. The feedback from former colleagues I like to hear is, "...we still use your code." because that statement suggests that it works, has not given them too much trouble, is fast enough, and (most important) readable.
c) counting cycles in any code can only be attempted with simulation. I've done it for an embedded military processor.
To simulate cache actions, you actually have to have usable code to evaluate, have to understand the interactions between processor and cache, and have to know cache block sizes of both data and instruction cache.
To choose the faster (as a criteria) is ... well if the first version meets the requirements, your customer (boss, team lead, etc.) will probably be unwilling to wait for / pay for faster code.
I've been on one big project to fix 'a system' that was deemed too slow (predecessor choices, not mine) ... management chose the safe to understand and cost estimate path: re-design the processor card with about 10x more processing cycles and 32x more ram. The software team refactored the code, and added the newest features as best we could. From overloaded, the new 'system' was running at 1/3 duty cycle. Big improvement that was visible in every command response time.

For Example 1 and 2, it depends on the context. Just doing stores if you are not using them, compiler will just ignore those. For the last example, it is a compilation error, you cannot point to Base object from Derived*
Check for yourself:
https://godbolt.org/g/pMiOfj
If Base extends, derived case (from example in comment):
Again the answer depends on how much information compiler knows. If it can see that it is just the object in stack, it will optimize and both of them
will be equivalent. https://godbolt.org/g/jIKQJv
Note: You don't have to worry about such details, Compiler knows better to optimize. Use the notion that is more readable. This is premature optimization.

CPU overhead for struct?

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;

"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret

No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.

My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.

The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.

If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Template version and non-template version of the same function

Consider the following simple function
void foo_rt(int n) {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
If I know the parameter n at compiletime, I can write a template version of the same function:
template<int n>
void foo_ct() {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
This allows the compiler to do things like loop unrolling, which increases speed.
But assume now that I sometimes know n at compiletime and sometimes only at runtime. How can I implement this without maintaining two versions of the function? I was thinking something along the lines:
inline void foo(int n) {
for(int i=0; i<n; ++i) {
// ... do something relatively cheap ...
}
}
// Runtime version
void foo_rt(int n) { foo(n); }
// Compiletime version
template<int n>
void foo_ct() { foo(n); }
But I am not sure if all compilers are smart enough to deal with this. Is there a better way?
EDIT:
Clearly, one solution that will work is to use macros, but this I really want to avoid:
#define foo_body \
{ \
for(int i=0; i<n; ++i) { \
// ... do something relatively cheap ... \
} \
}
// Runtime version
void foo_rt(int n) foo_body
// Compiletime version
template<int n>
void foo_ct() foo_body

I've done this before, using a integral_variable type and std::integral_constant. This looks like a lot of code, but if you look again, it's actually only a series of four very simple pieces, one of which is merely demo code.
#include <type_traits>
//type for acting like integeral_constant but with a variable
template<class underlying>
struct integral_variable {
const underlying value;
integral_variable(underlying v) :value(v) {}
};
//generic function
template<class value>
void foo(value n) {
for(int i=0; i<n.value; ++i) {
// ... do something relatively cheap ...
}
}
//optional: specialize so callers don't have to do casts
void foo_rt(int n) { return foo(integral_variable<int>(n)); }
template<int n>
void foo_ct() { return foo(std::integral_constant<unsigned, n>()); }
//notice it even handles different underlying types. Doesn't care.
//usage is simple
int main() {
foo_rt(3);
foo_ct<17>();
}

Much as I admire the DRY principle, I don't think there's a way around writing it twice.
Even though the code is the same, these are two very different operations -- working with a known value versus working with an unknown value.
You want to put the known one on a fast track to optimization that the unknown one may not qualify for.
What I would do is factor out all the code that does not depend on n into another function (which hopefully is the entire body of your for loop), and then have both your templated and non-templated versions call that within their loops. That way, the only thing you're repeating is the structure of the for loop, which I wouldn't consider a big deal.

If a value is known at compile time routing it through a template as a template parameter doesn't make it any more known at compile time. I think it's very unlikely that there are any compilers out there that will inline and optimize a function simply because the variable is a template parameter rather than some other kind of compile time constant.
Depending on your compiler you may not even need two versions of the function. An optimizing compiler may well just be able to optimize a function called with constant expression parameters. For example:
extern volatile int *I;
void foo(int n) {
for (int i=0;i<n;++i)
*I = i;
}
int main(int argc,char *[]) {
foo(4);
foo(argc);
}
My compiler turns this into an inlined, unrolled loop from 0 to 3, followed by an inlined loop on argc:
main: # #main
# BB#0: # %entry
movq I(%rip), %rax
movl $0, (%rax)
movl $1, (%rax)
movl $2, (%rax)
movl $3, (%rax)
testl %ecx, %ecx
jle .LBB1_3
# BB#1: # %for.body.lr.ph.i
xorl %eax, %eax
movq I(%rip), %rdx
.align 16, 0x90
.LBB1_2: # %for.body.i4
# =>This Inner Loop Header: Depth=1
movl %eax, (%rdx)
incl %eax
cmpl %eax, %ecx
jne .LBB1_2
.LBB1_3: # %_Z3fooi.exit5
xorl %eax, %eax
ret
To get such optimizations you either need to ensure the definition is available to all translation units (e.g., by defining the function as inline in a header file), or have a compiler that does link-time optimization.
If you use this and you're really depending on some things being computed at compile time then you should have automated tests to verify it gets done.
C++11 provides constexpr, which allows you to write a function that will be computed at compile time when given constexpr parameters, or to guarantee that a value is computed at compile time. There are restriction on what can go in a constexpr function, which may make it difficult to implement your function as a constexpr, but the allowed language is apparently turing complete. One issue is that, while the restrictions guarantee that the computation can be done at compile time if given constexpr parameters, those restrictions may result in an inefficient implementation for when the parameters are not constexpr.

Why not something like:
template<typename getter>
void f(getter g)
{
for (int i =0; i < g.get(); i++) { blah(); }
}
struct getter1 { inline constexpr int get() { return 1; } }
struct getterN { getterN(): _N(N) {} inline constexpr int get() { return k; } }
f<getter1>(getter1());
f<getterN>(getterN(100));

I would point out that if you have:
// Runtime version
void foo_rt(int n){ foo(n);}
... and this works for you, then you do in fact know the type at compile time. At least, you know a type that it is covariant with, and that's all you need to know. You can just use the templated version. If need be, you can specify the type at the call site, like this:
foo_rt<int>()

Is there any overhead to declaring a variable within a loop? (C++) [duplicate]

This question already has answers here:
Declaring variables inside loops, good practice or bad practice?
(9 answers)
Closed 19 days ago.
I am just wondering if there would be any loss of speed or efficiency if you did something like this:
int i = 0;
while(i < 100)
{
int var = 4;
i++;
}
which declares int var one hundred times. It seems to me like there would be, but I'm not sure. would it be more practical/faster to do this instead:
int i = 0;
int var;
while(i < 100)
{
var = 4;
i++;
}
or are they the same, speedwise and efficiency-wise?

Stack space for local variables is usually allocated in function scope. So no stack pointer adjustment happens inside the loop, just assigning 4 to var. Therefore these two snippets have the same overhead.

For primitive types and POD types, it makes no difference. The compiler will allocate the stack space for the variable at the beginning of the function and deallocate it when the function returns in both cases.
For non-POD class types that have non-trivial constructors, it WILL make a difference -- in that case, putting the variable outside the loop will only call the constructor and destructor once and the assignment operator each iteration, whereas putting it inside the loop will call the constructor and destructor for every iteration of the loop. Depending on what the class' constructor, destructor, and assignment operator do, this may or may not be desirable.

They are both the same, and here's how you can find out, by looking at what the compiler does (even without optimisation set to high):
Look at what the compiler (gcc 4.0) does to your simple examples:
1.c:
main(){ int var; while(int i < 100) { var = 4; } }
gcc -S 1.c
1.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
2.c
main() { while(int i < 100) { int var = 4; } }
gcc -S 2.c
2.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
From these, you can see two things: firstly, the code is the same in both.
Secondly, the storage for var is allocated outside the loop:
subl $24, %esp
And finally the only thing in the loop is the assignment and condition check:
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
Which is about as efficient as you can be without removing the loop entirely.

These days it is better to declare it inside the loop unless it is a constant as the compiler will be able to better optimize the code (reducing variable scope).
EDIT: This answer is mostly obsolete now. With the rise of post-classical compilers, the cases where the compiler can't figure it out are getting rare. I can still construct them but most people would classify the construction as bad code.

Most modern compilers will optimize this for you. That being said I would use your first example as I find it more readable.

For a built-in type there will likely be no difference between the 2 styles (probably right down to the generated code).
However, if the variable is a class with a non-trivial constructor/destructor there could well be a major difference in runtime cost. I'd generally scope the variable to inside the loop (to keep the scope as small as possible), but if that turns out to have a perf impact I'd look to moving the class variable outside the loop's scope. However, doing that needs some additional analysis as the semantics of the ode path may change, so this can only be done if the sematics permit it.
An RAII class might need this behavior. For example, a class that manages file access lifetime might need to be created and destroyed on each loop iteration to manage the file access properly.
Suppose you have a LockMgr class that acquires a critical section when it's constructed and releases it when destroyed:
while (i< 100) {
LockMgr lock( myCriticalSection); // acquires a critical section at start of
// each loop iteration
// do stuff...
} // critical section is released at end of each loop iteration
is quite different from:
LockMgr lock( myCriticalSection);
while (i< 100) {
// do stuff...
}

Both loops have the same efficiency. They will both take an infinite amount of time :) It may be a good idea to increment i inside the loops.

I once ran some perfomance tests, and to my surprise, found that case 1 was actually faster! I suppose this may be because declaring the variable inside the loop reduces its scope, so it gets free'd earlier. However, that was a long time ago, on a very old compiler. Im sure modern compilers do a better job of optimizing away the diferences, but it still doesn't hurt to keep your variable scope as short as possible.

#include <stdio.h>
int main()
{
for(int i = 0; i < 10; i++)
{
int test;
if(i == 0)
test = 100;
printf("%d\n", test);
}
}
Code above always prints 100 10 times which means local variable inside loop is only allocated once per each function call.

The only way to be sure is to time them. But the difference, if there is one, will be microscopic, so you will need a mighty big timing loop.
More to the point, the first one is better style because it initializes the variable var, while the other one leaves it uninitialized. This and the guideline that one should define variables as near to their point of use as possible, means that the first form should normally be preferred.

With only two variables, the compiler will likely be assign a register for both. These registers are there anyway, so this doesn't take time. There are 2 register write and one register read instruction in either case.

I think that most answers are missing a major point to consider which is: "Is it clear" and obviously by all the discussion the fact is; no it is not.
I'd suggest in most loop code the efficiency is pretty much a non-issue (unless you calculating for a mars lander), so really the only question is what looks more sensible and readable & maintainable - in this case I'd recommend declaring the variable up front & outside the loop - this simply makes it clearer. Then people like you & I would not even bother to waste time checking online to see if it's valid or not.

thats not true
there is overhead however its neglect able overhead.
Even though probably they will end up at same place on stack It still assigns it.
It will assign memory location on stack for that int and then free it at the end of }. Not in heap free sense in sense it will move sp (stack pointer) by 1.
And in your case considering it only has one local variable it will just simply equate fp(frame pointer) and sp
Short answer would be: DONT CARE EITHER WAY WORKS ALMOST THE SAME.
But try reading more on how stack is organized. My undergrad school had pretty good lectures on that
If you wanna read more check here
http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Assembler1/lecture.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js