CPU overhead for struct?

CPU overhead for struct? - c++

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;

"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret

No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.

My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.

The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.

If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Related

Performance difference between POD and non-POD classes

I'm struggling to understand why my compilers (g++ 8.1.0 and clang++ 6.0.0) treat POD (plain-old-data) and non-POD code differently.
Test code:
#include <iostream>
struct slong {
int i;
~slong() { i = 0; }
};
int get1(slong x) { return 1+x.i; }
int main() {
std::cerr << "is_pod(slong) = " << std::is_pod<slong>::value << std::endl;
}
defines a class slong with a destructor (hence not POD) and the compiler, with -Ofast, will produce for get1
movl (%rdi), %eax
incl %eax
but when I comment out the destructor (so slong becomes POD) I get
leal 1(%rdi), %eax
Of course the performance issue is minor; still I'd like to understand. In other (more complicated) cases I also noticed more significant code differences.

Note that movl accesses memory, while leal doesn't.
When passing a struct to a function by value, ABI can stuff it into a register (rdi) if it's POD.
If the struct is not POD, ABI must pass it on stack (presumably because the code may need its address to call the destructor, access the vtable and do other complicated stuff). So accessing its member requires indirection.

Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

Background
This was inspired by this question/answer and ensuing discussion in the comments: Is the definition of “volatile” this volatile, or is GCC having some standard compliancy problems?. Based on others' and my interpretation of what should happening, as discussed in comments, I've submitted it to GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 Other relevant responses are still welcome.
Also, that thread has since given rise to this question: Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?
Intro
I know volatile isn't what most people think it is and is an implementation-defined nest of vipers. And I certainly don't want to use the below constructs in any real code. That said, I'm totally baffled by what's going on in these examples, so I'd really appreciate any elucidation.
My guess is this is due to either highly nuanced interpretation of the Standard or (more likely?) just corner-cases for the optimiser used. Either way, while more academic than practical, I hope this is deemed valuable to analyse, especially given how typically misunderstood volatile is. Some more data points - or perhaps more likely, points against it - must be good.
Input
Given this code:
#include <cstddef>
void f(void *const p, std::size_t n)
{
unsigned char *y = static_cast<unsigned char *>(p);
volatile unsigned char const x = 42;
// N.B. Yeah, const is weird, but it doesn't change anything
while (n--) {
*y++ = x;
}
}
void g(void *const p, std::size_t n, volatile unsigned char const x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
void h(void *const p, std::size_t n, volatile unsigned char const &x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
int main(int, char **)
{
int y[1000];
f(&y, sizeof y);
volatile unsigned char const x{99};
g(&y, sizeof y, x);
h(&y, sizeof y, x);
}
Output
g++ from gcc (Debian 4.9.2-10) 4.9.2 (Debian stable a.k.a. Jessie) with the command line g++ -std=c++14 -O3 -S test.cpp produces the below ASM for main(). Version Debian 5.4.0-6 (current unstable) produces equivalent code, but I just happened to run the older one first, so here it is:
main:
.LFB3:
.cfi_startproc
# f()
movb $42, -1(%rsp)
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L21:
subq $1, %rax
movzbl -1(%rsp), %edx
jne .L21
# x = 99
movb $99, -2(%rsp)
movzbl -2(%rsp), %eax
# g()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L22:
subq $1, %rax
jne .L22
# h()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L23:
subq $1, %rax
movzbl -2(%rsp), %edx
jne .L23
# return 0;
xorl %eax, %eax
ret
.cfi_endproc
Analysis
All 3 functions are inlined, and both that allocate volatile local variables do so on the stack for fairly obvious reasons. But that's about the only thing they share...
f() ensures to read from x on each iteration, presumably due to its volatile - but just dumps the result to edx, presumably because the destination y isn't declared volatile and is never read, meaning changes to it can be nixed under the as-if rule. OK, makes sense.
Well, I mean... kinda. Like, not really, because volatile is really for hardware registers, and clearly a local value can't be one of those - and can't otherwise be modified in a volatile way unless its address is passed out... which it's not. Look, there's just not a lot of sense to be had out of volatile local values. But C++ lets us declare them and tries to do something with them. And so, confused as always, we stumble onwards.
g(): What. By moving the volatile source into a pass-by-value parameter, which is still just another local variable, GCC somehow decides it's not or less volatile, and so it doesn't need to read it every iteration... but it still carries out the loop, despite its body now doing nothing.
h(): By taking the passed volatile as pass-by-reference, the same effective behaviour as f() is restored, so the loop does volatile reads.
This case alone actually makes practical sense to me, for reasons outlined above against f(). To elaborate: Imagine x refers to a hardware register, of which every read has side-effects. You wouldn't want to skip any of those.
Adding #define volatile /**/ leads to main() being a no-op, as you'd expect. So, when present, even on a local variable volatile does do something... I just have no idea what in the case of g(). What on Earth is going on there?
Questions
Why does a local value declared in-body produce different results from a by-value parameter, with the former letting reads be optimised away? Both are declared volatile. Neither have an address passed out - and don't have a static address, ruling out any inline-ASM POKEry - so they can never be modified outwith the function. The compiler can see that each is constant, need never be re-read, and volatile just ain't true -
so (A) is either allowed to be elided under such constraints? (acting as-if they weren't declared volatile) -
and (B) why does only one get elided? Are some volatile local variables more volatile than others?
Setting aside that inconsistency for just a moment: After the read was optimised away, why does the compiler still generate the loop? It does nothing! Why doesn't the optimiser elide it as-if no loop was coded?
Is this a weird corner case due to order of optimising analyses or such? As the code is a daft thought-experiment, I wouldn't chastise GCC for this, but it'd be good to know for sure. (Or is g() the manual timing loop people have dreamt of all these years?) If we conclude there's no Standard bearing on any of this, I'll move it to their Bugzilla just for their information.
And of course, the more important question from a practical perspective, though I don't want that to overshadow the potential for compiler geekery... Which, if any of these, are well-defined/correct according to the Standard?

For f: GCC eliminates the non-volatile stores (but not the loads, which can have side-effects if the source location is a memory mapped hardware register). There is really nothing surprising here.
For g: Because of the x86_64 ABI the parameter x of g is allocated in a register (i.e. rdx) and does not have a location in memory. Reading a general purpose register does not have any observable side effects so the dead read gets eliminted.

Performance hit of vtable lookup in C++

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).
An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:
// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);
// Array of pointers to structs containing each function's parameters.
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.
Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:
class Base
{
virtual void Execute();
int something_all_things_need;
}
class Derived1 : Base
{
void Execute() { /* Do something */ }
int own_parameter;
// Other own parameters
}
class Derived2 : Base { /* Etc. */ }
Base *todolist[200];
void realtime(void)
{
for (int i = 0; i < 200; i++)
todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}
My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...
What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.
Update (including correction of where the loop starts):
The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:
C:
i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;
Here's the actual loop:
r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);
C++ :
i5=0xb279a;
r15=0xc8;
Here's the actual loop:
i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);
In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.
Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.
I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.
C code (based on OP):
void (*todolist[200])(void *parameters);
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
C++ code:
class Base {
public:
Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
virtual void operator()() = 0;
protected:
void* unsafe_pointer_;
};
Base* todolist[200];
void realtime() {
for (int i = 0; i < 200; ++i)
(*todolist[i])();
}
Both compiled with gcc 4.8, -O3:
realtime: |_Z8realtimev:
.LFB0: |.LFB3:
.cfi_startproc | .cfi_startproc
pushq %rbx | pushq %rbx
.cfi_def_cfa_offset 16 | .cfi_def_cfa_offset 16
.cfi_offset 3, -16 | .cfi_offset 3, -16
xorl %ebx, %ebx | movl $todolist, %ebx
.p2align 4,,10 | .p2align 4,,10
.p2align 3 | .p2align 3
.L3: |.L3:
movq paramlist(%rbx), %rdi | movq (%rbx), %rdi
call *todolist(%rbx) | addq $8, %rbx
addq $8, %rbx | movq (%rdi), %rax
| call *(%rax)
cmpq $1600, %rbx | cmpq $todolist+1600, %rbx
jne .L3 | jne .L3
popq %rbx | popq %rbx
.cfi_def_cfa_offset 8 | .cfi_def_cfa_offset 8
ret | ret
In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.
According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).
# Initialization.
# i3=todolist; i5=paramlist | # i5=todolist holds paramlist
i3=0xb27ba; | # No paramlist in C++
i5=0xb28e6; | i5=0xb279a;
# r15=count
r15=0xc8; | r15=0xc8;
# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++; | i5=modify(i5,m6); # i4 = *todolist++
| i4=dm(m7,i5); # ..
# Note 2:
| r2=i4; # r2 = obj
| i4=dm(m6,i4); # vtable = *(obj + 1)
| r1=dm(0x3,i4); # r1 = vtable[3]
| r4=r2+r1; # param = obj + r1
i12=dm(i3,m6); # i12 = *todolist++; | i12=dm(0x5,i4); # i12 = vtable[5]
# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6; | r2=i6;
i6=i7; | i6=i7;
jump (m13,i12) (db); | jump (m13,i12) (db);
dm(i7,m7)=r2; | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de; | dm(i7,m7)=0x1279e2;
# if (count--) loop
r15=r15-1; | r15=r15-1;
if ne jump (pc, 0xfffffff2); | if ne jump (pc, 0xffffffe7);
Notes:
In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.
So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?
The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.
There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)
As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)
(Original test, before update):
I tried the following simple case (no multiple inheritance, which can slow things down):
class Base {
public:
Base(int val) : val_(val) {}
virtual int binary(int a, int b) = 0;
virtual int unary(int a) = 0;
virtual int nullary() = 0;
protected:
int val_;
};
int binary(Base* begin, Base* end, int a, int b) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->binary(a, b); }
return accum;
}
int unary(Base* begin, Base* end, int a) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->unary(a); }
return accum;
}
int nullary(Base* begin, Base* end) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->nullary(); }
return accum;
}
And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:
.L9:
movq (%rbx), %rax
movq %rbx, %rdi
addq $16, %rbx
movl %r13d, %esi
call *8(%rax)
addl %eax, %ebp
cmpq %rbx, %r12
jne .L9

As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:
template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
void execute() {
// I construct temporary objects here since I could not figure out how you
// construct your objects. You can change these signatures to allow for
// passing arbitrary params to these handlers.
FirstCb().execute();
InterruptHandler<RestCb...>().execute();
}
}
InterruptHandler</* Base, Derived1, and so on */> handler;
void realtime(void) {
handler.execute();
}
This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.
Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same.
Also, this requires that you have a C++11 compliant compiler.

I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.
You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.
Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.
Also, can you move any of the functionality out of the ISR?
Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.

You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:
#include <iostream>
template<class ParamType,class F>
void fun(void* param) {
F f;
f(*static_cast<ParamType*>(param));
}
struct my_function {
void operator()(int& i) {
std::cout << "got it " << i << std::endl;
}
};
int main() {
void (*func)(void*) = fun<int, my_function>;
int j=4;
func(&j);
return 0;
}
In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.
In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).

This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:
void (*todo[3])(void *);
void *param[3];
void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}
template<int N>
struct Obj {
static void apply()
{
todo[N-1](param[N-1]);
Obj<N-1>::apply();
}
};
template<> struct Obj<0> { static void apply() {} };
todo[0] = f1;
todo[1] = f2;
todo[2] = f3;
Obj<sizeof todo / sizeof *todo>::apply();

Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

C++: Structs slower to access than basic variables?

I found some code that had "optimization" like this:
void somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
}
And the comments said that it will make the variable access faster, but i've always thought structs element access is as fast as if it wasnt struct after all.
Could someone clear my head off this?

The usual rules for optimization (Michael A. Jackson) apply:
1. Don't do it.
2. (For experts only:) Don't do it yet.
That being said, let's assume it's the innermost loop that takes 80% of the time of a performance-critical application. Even then, I doubt you will ever see any difference. Let's use this piece of code for instance:
struct Xyz {
float x, y, z;
};
float f(Xyz param){
return param.x + param.y + param.z;
}
float g(Xyz param){
float x = param.x;
float y = param.y;
float z = param.z;
return x + y + z;
}
Running it through LLVM shows: Only with no optimizations, the two act as expected (g copies the struct members into locals, then proceeds sums those; f sums the values fetched from param directly). With standard optimization levels, both result in identical code (extracting the values once, then summing them).
For short code, this "optimization" is actually harmful, as it copies the floats needlessly. For longer code using the members in several places, it might help a teensy bit if you actively tell your compiler to be stupid. A quick test with 65 (instead of 2) additions of the members/locals confirms this: With no optimizations, f repeatedly loads the struct members while g reuses the already extracted locals. The optimized versions are again identical and both extract the members only once. (Surprisingly, there's no strength reduction turning the additions into multiplications even with LTO enabled, but that just indicates the LLVM version used isn't optimizing too agressively anyway - so it should work just as well in other compilers.)
So, the bottom line is: Unless you know your code will have to be compiled by a compiler that's so outragously stupid and/or ancient that it won't optimize anything, you now have proof that the compiler will make both ways equivalent and can thus do away with this crime against readability and brewity commited in the name of performance. (Repeat the experiment for your particular compiler if necessary.)

Rule of thumb: it's not slow, unless profiler says it is. Let the compiler worry about micro-optimisations (they're pretty smart about them; after all, they've been doing it for years) and focus on the bigger picture.

I'm no compiler guru, so take this with a grain of salt. I'm guessing that the original author of the code is assuming that by copying the values from the struct into local variables, the compiler has "placed" those variables into floating point registers which are available on some platforms (e.g., x86). If there aren't enough registers to go around, they'd be put in the stack.
That being said, unless this code was in the middle of an intensive computation/loop, I'd strive for clarity rather than speed. It's pretty rare that anyone is going to notice a few instructions difference in timing.

You'd have to look at the compiled code on a particular implementation to be sure, but there's no reason in principle why your preferred code (using the struct members) should necessarily be any slower than the code you've shown (copying into variables and then using the variables).
someFunc takes a struct by value, so it has its own local copy of that struct. The compiler is perfectly at liberty to apply exactly the same optimizations to the struct members, as it would apply to the float variables. They're both automatic variables, and in both cases the "as-if" rule allows them to be stored in register(s) rather than in memory provided that the function produces the correct observable behavior.
This is unless of course you take a pointer to the struct and use it, in which case the values need to be written in memory somewhere, in the correct order, pointed to by the pointer. This starts to limit optimization, and other limits are introduced by the fact that if you pass around a pointer to an automatic variable, the compiler can no longer assume that the variable name is the only reference to that memory and hence the only way its contents can be modified. Having multiple references to the same object is called "aliasing", and does sometimes block optimizations that could be made if the object was somehow known not to be aliased.
Then again, if this is an issue, and the rest of the code in the function somehow does use a pointer to the struct, then of course you could be on dodgy ground copying the values into variables from the POV of correctness. So the claimed optimization is not quite so straightforward as it looks in that case.
Now, there may be particular compilers (or particular optimization levels) which fail to apply to structs all the optimizations that they're permitted to apply, but do apply equivalent optimizations to float variables. If so then the comment would be right, and that's why you have to check to be sure. For example, maybe compare the emitted code for this:
float somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
for (int i = 0; i < 10; ++i) {
x += (y +i) * z;
}
return x;
}
with this:
float somefunc(SomeStruct param){
for (int i = 0; i < 10; ++i) {
param.x += (param.y +i) * param.z;
}
return param.x;
}
There may also be optimization levels where the extra variables make the code worse. I'm not sure I put much trust in code comments that say "supposedly this makes it faster access", sounds like the author doesn't really have a clear idea why it matters. "Apparently it makes it faster access - I don't know why but the tests to confirm this and to demonstrate that it makes a noticeable difference in the context of our program, are in source control in the following location" is a lot more like it ;-)

In an unoptimised code:
function parameters (which are not passed by reference) are on the stack
local variables are also on the stack
Unoptimised access to local variables and function parameters in an assembly language look more-or-less like this:
mov %eax, %ebp+ compile-time-constant
where %ebp is a frame pointer (sort of 'this' pointer for a function).
It makes no difference if you access a parameter or a local variable.
The fact that you are accessing an element from a struct makes absolutely no difference from the assembly/machine point of view. Structs are constructs made in C to make programmer's life easier.
So, ulitmately, my answer is: No, there is absolutely no benefit in doing that.

There are good and valid reasons to do that kind of optimization when pointers are used, because consuming all inputs first frees the compiler from possible aliasing issues which prevent it from producing optimal code (there's restrict nowadays too, though).
For non-pointer types, there is in theory an overhead because every member is accessed via the struct's this pointer. This may in theory be noticeable within an inner loop and will in theory be a diminuitive overhead otherwise.In practice, however, a modern compiler will almost always (unless there is a complex inheritance hierarchy) produce the exact same binary code.
I had asked myself the exact same question as you did about two years ago and did a very extensive test case using gcc 4.4. My findings were that unless you really try to throw sticks between the compiler's legs on purpose, there is absolutely no difference in the generated code.

The real answer is given by Piotr. This one is just for fun.
I have tested it. This code:
float somefunc(SomeStruct param, float &sum){
float x = param.x;
float y = param.y;
float z = param.z;
float xyz = x * y * z;
sum = x + y + z;
return xyz;
}
And this code:
float somefunc(SomeStruct param, float &sum){
float xyz = param.x * param.y * param.z;
sum = param.x + param.y + param.z;
return xyz;
}
Generate identical assembly code when compiled with g++ -O2. They do generate different code with optimization turned off, though. Here is the difference:
< movl -32(%rbp), %eax
< movl %eax, -4(%rbp)
< movl -28(%rbp), %eax
< movl %eax, -8(%rbp)
< movl -24(%rbp), %eax
< movl %eax, -12(%rbp)
< movss -4(%rbp), %xmm0
< mulss -8(%rbp), %xmm0
< mulss -12(%rbp), %xmm0
< movss %xmm0, -16(%rbp)
< movss -4(%rbp), %xmm0
< addss -8(%rbp), %xmm0
< addss -12(%rbp), %xmm0
---
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> mulss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> mulss %xmm1, %xmm0
> movss %xmm0, -4(%rbp)
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> addss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> addss %xmm1, %xmm0
The lines marked < correspond to the version with "optimization" variables. It seems to me that the "optimized" version is even slower than the one with no extra variables. This is to be expected, though, as x, y and z are allocated on the stack, exactly like the param. What's the point of allocating more stack variables to duplicate existing ones?
If the one who did that "optimization" knew the language better, he would probably have declared those variables as register, but even that leaves the "optimized" version slightly slower and longer, at least on G++/x86-64.

Compiler may make faster code to copy float-to-float.
But when x will used it will be converted to internal FPU representation.

When you specify a "simple" variable (not a struct/class) to be operated upon, the system only has to go to that place and fetch the data it wants.
But when you refer to a variable inside a struct or class, like A.B, the system needs to calculate where B is inside that area called A (because there may be other variables declared before it), and that calculation takes a bit more than the the more plain access described above.

Is there any overhead to declaring a variable within a loop? (C++) [duplicate]

This question already has answers here:
Declaring variables inside loops, good practice or bad practice?
(9 answers)
Closed 19 days ago.
I am just wondering if there would be any loss of speed or efficiency if you did something like this:
int i = 0;
while(i < 100)
{
int var = 4;
i++;
}
which declares int var one hundred times. It seems to me like there would be, but I'm not sure. would it be more practical/faster to do this instead:
int i = 0;
int var;
while(i < 100)
{
var = 4;
i++;
}
or are they the same, speedwise and efficiency-wise?

Stack space for local variables is usually allocated in function scope. So no stack pointer adjustment happens inside the loop, just assigning 4 to var. Therefore these two snippets have the same overhead.

For primitive types and POD types, it makes no difference. The compiler will allocate the stack space for the variable at the beginning of the function and deallocate it when the function returns in both cases.
For non-POD class types that have non-trivial constructors, it WILL make a difference -- in that case, putting the variable outside the loop will only call the constructor and destructor once and the assignment operator each iteration, whereas putting it inside the loop will call the constructor and destructor for every iteration of the loop. Depending on what the class' constructor, destructor, and assignment operator do, this may or may not be desirable.

They are both the same, and here's how you can find out, by looking at what the compiler does (even without optimisation set to high):
Look at what the compiler (gcc 4.0) does to your simple examples:
1.c:
main(){ int var; while(int i < 100) { var = 4; } }
gcc -S 1.c
1.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
2.c
main() { while(int i < 100) { int var = 4; } }
gcc -S 2.c
2.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
From these, you can see two things: firstly, the code is the same in both.
Secondly, the storage for var is allocated outside the loop:
subl $24, %esp
And finally the only thing in the loop is the assignment and condition check:
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
Which is about as efficient as you can be without removing the loop entirely.

These days it is better to declare it inside the loop unless it is a constant as the compiler will be able to better optimize the code (reducing variable scope).
EDIT: This answer is mostly obsolete now. With the rise of post-classical compilers, the cases where the compiler can't figure it out are getting rare. I can still construct them but most people would classify the construction as bad code.

Most modern compilers will optimize this for you. That being said I would use your first example as I find it more readable.

For a built-in type there will likely be no difference between the 2 styles (probably right down to the generated code).
However, if the variable is a class with a non-trivial constructor/destructor there could well be a major difference in runtime cost. I'd generally scope the variable to inside the loop (to keep the scope as small as possible), but if that turns out to have a perf impact I'd look to moving the class variable outside the loop's scope. However, doing that needs some additional analysis as the semantics of the ode path may change, so this can only be done if the sematics permit it.
An RAII class might need this behavior. For example, a class that manages file access lifetime might need to be created and destroyed on each loop iteration to manage the file access properly.
Suppose you have a LockMgr class that acquires a critical section when it's constructed and releases it when destroyed:
while (i< 100) {
LockMgr lock( myCriticalSection); // acquires a critical section at start of
// each loop iteration
// do stuff...
} // critical section is released at end of each loop iteration
is quite different from:
LockMgr lock( myCriticalSection);
while (i< 100) {
// do stuff...
}

Both loops have the same efficiency. They will both take an infinite amount of time :) It may be a good idea to increment i inside the loops.

I once ran some perfomance tests, and to my surprise, found that case 1 was actually faster! I suppose this may be because declaring the variable inside the loop reduces its scope, so it gets free'd earlier. However, that was a long time ago, on a very old compiler. Im sure modern compilers do a better job of optimizing away the diferences, but it still doesn't hurt to keep your variable scope as short as possible.

#include <stdio.h>
int main()
{
for(int i = 0; i < 10; i++)
{
int test;
if(i == 0)
test = 100;
printf("%d\n", test);
}
}
Code above always prints 100 10 times which means local variable inside loop is only allocated once per each function call.

The only way to be sure is to time them. But the difference, if there is one, will be microscopic, so you will need a mighty big timing loop.
More to the point, the first one is better style because it initializes the variable var, while the other one leaves it uninitialized. This and the guideline that one should define variables as near to their point of use as possible, means that the first form should normally be preferred.

With only two variables, the compiler will likely be assign a register for both. These registers are there anyway, so this doesn't take time. There are 2 register write and one register read instruction in either case.

I think that most answers are missing a major point to consider which is: "Is it clear" and obviously by all the discussion the fact is; no it is not.
I'd suggest in most loop code the efficiency is pretty much a non-issue (unless you calculating for a mars lander), so really the only question is what looks more sensible and readable & maintainable - in this case I'd recommend declaring the variable up front & outside the loop - this simply makes it clearer. Then people like you & I would not even bother to waste time checking online to see if it's valid or not.

thats not true
there is overhead however its neglect able overhead.
Even though probably they will end up at same place on stack It still assigns it.
It will assign memory location on stack for that int and then free it at the end of }. Not in heap free sense in sense it will move sp (stack pointer) by 1.
And in your case considering it only has one local variable it will just simply equate fp(frame pointer) and sp
Short answer would be: DONT CARE EITHER WAY WORKS ALMOST THE SAME.
But try reading more on how stack is organized. My undergrad school had pretty good lectures on that
If you wanna read more check here
http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Assembler1/lecture.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CPU overhead for struct? - c++

No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.

The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code. Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.

Related

Performance difference between POD and non-POD classes

Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

Performance hit of vtable lookup in C++

C++: Structs slower to access than basic variables?

Is there any overhead to declaring a variable within a loop? (C++) [duplicate]

Categories

Resources