C++: Structs slower to access than basic variables?

C++: Structs slower to access than basic variables? - c++

I found some code that had "optimization" like this:
void somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
}
And the comments said that it will make the variable access faster, but i've always thought structs element access is as fast as if it wasnt struct after all.
Could someone clear my head off this?

The usual rules for optimization (Michael A. Jackson) apply:
1. Don't do it.
2. (For experts only:) Don't do it yet.
That being said, let's assume it's the innermost loop that takes 80% of the time of a performance-critical application. Even then, I doubt you will ever see any difference. Let's use this piece of code for instance:
struct Xyz {
float x, y, z;
};
float f(Xyz param){
return param.x + param.y + param.z;
}
float g(Xyz param){
float x = param.x;
float y = param.y;
float z = param.z;
return x + y + z;
}
Running it through LLVM shows: Only with no optimizations, the two act as expected (g copies the struct members into locals, then proceeds sums those; f sums the values fetched from param directly). With standard optimization levels, both result in identical code (extracting the values once, then summing them).
For short code, this "optimization" is actually harmful, as it copies the floats needlessly. For longer code using the members in several places, it might help a teensy bit if you actively tell your compiler to be stupid. A quick test with 65 (instead of 2) additions of the members/locals confirms this: With no optimizations, f repeatedly loads the struct members while g reuses the already extracted locals. The optimized versions are again identical and both extract the members only once. (Surprisingly, there's no strength reduction turning the additions into multiplications even with LTO enabled, but that just indicates the LLVM version used isn't optimizing too agressively anyway - so it should work just as well in other compilers.)
So, the bottom line is: Unless you know your code will have to be compiled by a compiler that's so outragously stupid and/or ancient that it won't optimize anything, you now have proof that the compiler will make both ways equivalent and can thus do away with this crime against readability and brewity commited in the name of performance. (Repeat the experiment for your particular compiler if necessary.)

Rule of thumb: it's not slow, unless profiler says it is. Let the compiler worry about micro-optimisations (they're pretty smart about them; after all, they've been doing it for years) and focus on the bigger picture.

I'm no compiler guru, so take this with a grain of salt. I'm guessing that the original author of the code is assuming that by copying the values from the struct into local variables, the compiler has "placed" those variables into floating point registers which are available on some platforms (e.g., x86). If there aren't enough registers to go around, they'd be put in the stack.
That being said, unless this code was in the middle of an intensive computation/loop, I'd strive for clarity rather than speed. It's pretty rare that anyone is going to notice a few instructions difference in timing.

You'd have to look at the compiled code on a particular implementation to be sure, but there's no reason in principle why your preferred code (using the struct members) should necessarily be any slower than the code you've shown (copying into variables and then using the variables).
someFunc takes a struct by value, so it has its own local copy of that struct. The compiler is perfectly at liberty to apply exactly the same optimizations to the struct members, as it would apply to the float variables. They're both automatic variables, and in both cases the "as-if" rule allows them to be stored in register(s) rather than in memory provided that the function produces the correct observable behavior.
This is unless of course you take a pointer to the struct and use it, in which case the values need to be written in memory somewhere, in the correct order, pointed to by the pointer. This starts to limit optimization, and other limits are introduced by the fact that if you pass around a pointer to an automatic variable, the compiler can no longer assume that the variable name is the only reference to that memory and hence the only way its contents can be modified. Having multiple references to the same object is called "aliasing", and does sometimes block optimizations that could be made if the object was somehow known not to be aliased.
Then again, if this is an issue, and the rest of the code in the function somehow does use a pointer to the struct, then of course you could be on dodgy ground copying the values into variables from the POV of correctness. So the claimed optimization is not quite so straightforward as it looks in that case.
Now, there may be particular compilers (or particular optimization levels) which fail to apply to structs all the optimizations that they're permitted to apply, but do apply equivalent optimizations to float variables. If so then the comment would be right, and that's why you have to check to be sure. For example, maybe compare the emitted code for this:
float somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
for (int i = 0; i < 10; ++i) {
x += (y +i) * z;
}
return x;
}
with this:
float somefunc(SomeStruct param){
for (int i = 0; i < 10; ++i) {
param.x += (param.y +i) * param.z;
}
return param.x;
}
There may also be optimization levels where the extra variables make the code worse. I'm not sure I put much trust in code comments that say "supposedly this makes it faster access", sounds like the author doesn't really have a clear idea why it matters. "Apparently it makes it faster access - I don't know why but the tests to confirm this and to demonstrate that it makes a noticeable difference in the context of our program, are in source control in the following location" is a lot more like it ;-)

In an unoptimised code:
function parameters (which are not passed by reference) are on the stack
local variables are also on the stack
Unoptimised access to local variables and function parameters in an assembly language look more-or-less like this:
mov %eax, %ebp+ compile-time-constant
where %ebp is a frame pointer (sort of 'this' pointer for a function).
It makes no difference if you access a parameter or a local variable.
The fact that you are accessing an element from a struct makes absolutely no difference from the assembly/machine point of view. Structs are constructs made in C to make programmer's life easier.
So, ulitmately, my answer is: No, there is absolutely no benefit in doing that.

There are good and valid reasons to do that kind of optimization when pointers are used, because consuming all inputs first frees the compiler from possible aliasing issues which prevent it from producing optimal code (there's restrict nowadays too, though).
For non-pointer types, there is in theory an overhead because every member is accessed via the struct's this pointer. This may in theory be noticeable within an inner loop and will in theory be a diminuitive overhead otherwise.In practice, however, a modern compiler will almost always (unless there is a complex inheritance hierarchy) produce the exact same binary code.
I had asked myself the exact same question as you did about two years ago and did a very extensive test case using gcc 4.4. My findings were that unless you really try to throw sticks between the compiler's legs on purpose, there is absolutely no difference in the generated code.

The real answer is given by Piotr. This one is just for fun.
I have tested it. This code:
float somefunc(SomeStruct param, float &sum){
float x = param.x;
float y = param.y;
float z = param.z;
float xyz = x * y * z;
sum = x + y + z;
return xyz;
}
And this code:
float somefunc(SomeStruct param, float &sum){
float xyz = param.x * param.y * param.z;
sum = param.x + param.y + param.z;
return xyz;
}
Generate identical assembly code when compiled with g++ -O2. They do generate different code with optimization turned off, though. Here is the difference:
< movl -32(%rbp), %eax
< movl %eax, -4(%rbp)
< movl -28(%rbp), %eax
< movl %eax, -8(%rbp)
< movl -24(%rbp), %eax
< movl %eax, -12(%rbp)
< movss -4(%rbp), %xmm0
< mulss -8(%rbp), %xmm0
< mulss -12(%rbp), %xmm0
< movss %xmm0, -16(%rbp)
< movss -4(%rbp), %xmm0
< addss -8(%rbp), %xmm0
< addss -12(%rbp), %xmm0
---
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> mulss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> mulss %xmm1, %xmm0
> movss %xmm0, -4(%rbp)
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> addss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> addss %xmm1, %xmm0
The lines marked < correspond to the version with "optimization" variables. It seems to me that the "optimized" version is even slower than the one with no extra variables. This is to be expected, though, as x, y and z are allocated on the stack, exactly like the param. What's the point of allocating more stack variables to duplicate existing ones?
If the one who did that "optimization" knew the language better, he would probably have declared those variables as register, but even that leaves the "optimized" version slightly slower and longer, at least on G++/x86-64.

Compiler may make faster code to copy float-to-float.
But when x will used it will be converted to internal FPU representation.

When you specify a "simple" variable (not a struct/class) to be operated upon, the system only has to go to that place and fetch the data it wants.
But when you refer to a variable inside a struct or class, like A.B, the system needs to calculate where B is inside that area called A (because there may be other variables declared before it), and that calculation takes a bit more than the the more plain access described above.

Related

Why can't GCC assume that std::vector::size won't change in this loop?

I claimed to a coworker that if (i < input.size() - 1) print(0); would get optimized in this loop so that input.size() is not read in every iteration, but it turns out that this is not the case!
void print(int x) {
std::cout << x << std::endl;
}
void print_list(const std::vector<int>& input) {
int i = 0;
for (size_t i = 0; i < input.size(); i++) {
print(input[i]);
if (i < input.size() - 1) print(0);
}
}
According to the Compiler Explorer with gcc options -O3 -fno-exceptions we are actually reading input.size() each iteration and using lea to perform a subtraction!
movq 0(%rbp), %rdx
movq 8(%rbp), %rax
subq %rdx, %rax
sarq $2, %rax
leaq -1(%rax), %rcx
cmpq %rbx, %rcx
ja .L35
addq $1, %rbx
Interestingly, in Rust this optimization does occur. It looks like i gets replaced with a variable j that is decremented each iteration, and the test i < input.size() - 1 is replaced with something like j > 0.
fn print(x: i32) {
println!("{}", x);
}
pub fn print_list(xs: &Vec<i32>) {
for (i, x) in xs.iter().enumerate() {
print(*x);
if i < xs.len() - 1 {
print(0);
}
}
}
In the Compiler Explorer the relevant assembly looks like this:
cmpq %r12, %rbx
jae .LBB0_4
I checked and I am pretty sure r12 is xs.len() - 1 and rbx is the counter. Earlier there is an add for rbx and a mov outside of the loop into r12.
Why is this? It seems like if GCC is able to inline the size() and operator[] as it did, it should be able to know that size() does not change. But maybe GCC's optimizer judges that it is not worth pulling it out into a variable? Or maybe there is some other possible side effect that would make this unsafe--does anyone know?

The non-inline function call to cout.operator<<(int) is a black box for the optimizer (because the library is just written in C++ and all the optimizer sees is a prototype; see discussion in comments). It has to assume any memory that could possibly be pointed to by a global var has been modified.
(Or the std::endl call. BTW, why force a flush of cout at that point instead of just printing a '\n'?)
e.g. for all it knows, std::vector<int> &input is a reference to a global variable, and one of those function calls modifies that global var. (Or there's a global vector<int> *ptr somewhere, or there's a function that returns a pointer to a static vector<int> in some other compilation unit, or some other way that a function could get a reference to this vector without being passed a reference to it by us.
If you had a local variable whose address had never been taken, the compiler could assume that non-inline function calls couldn't mutate it. Because there'd be no way for any global variable to hold a pointer to this object. (This is called Escape Analysis). That's why the compiler can keep size_t i in a register across function calls. (int i can just get optimized away because it's shadowed by size_t i and not used otherwise).
It could do the same with a local vector (i.e. for the base, end_size and end_capacity pointers.)
ISO C99 has a solution for this problem: int *restrict foo. Many C++ compiles support int *__restrict foo to promise that memory pointed to by foo is only accessed via that pointer. Most commonly useful in functions that take 2 arrays, and you want to promise the compiler they don't overlap. So it can auto-vectorize without generating code to check for that and run a fallback loop.
The OP comments:
In Rust a non-mutable reference is a global guarantee that no one else is mutating the value you have a reference to (equivalent to C++ restrict)
That explains why Rust can make this optimization but C++ can't.
Optimizing your C++
Obviously you should use auto size = input.size(); once at the top of your function so the compiler knows it's a loop invariant. C++ implementations don't solve this problem for you, so you have to do it yourself.
You might also need const int *data = input.data(); to hoist loads of the data pointer from the std::vector<int> "control block" as well. It's unfortunate that optimizing can require very non-idiomatic source changes.
Rust is a much more modern language, designed after compiler developers learned what was possible in practice for compilers. It really shows in other ways, too, including portably exposing some of the cool stuff CPUs can do via i32.count_ones, rotate, bit-scan, etc. It's really dumb that ISO C++ still doesn't expose any of these portably, except std::bitset::count().

Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

Background
This was inspired by this question/answer and ensuing discussion in the comments: Is the definition of “volatile” this volatile, or is GCC having some standard compliancy problems?. Based on others' and my interpretation of what should happening, as discussed in comments, I've submitted it to GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 Other relevant responses are still welcome.
Also, that thread has since given rise to this question: Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?
Intro
I know volatile isn't what most people think it is and is an implementation-defined nest of vipers. And I certainly don't want to use the below constructs in any real code. That said, I'm totally baffled by what's going on in these examples, so I'd really appreciate any elucidation.
My guess is this is due to either highly nuanced interpretation of the Standard or (more likely?) just corner-cases for the optimiser used. Either way, while more academic than practical, I hope this is deemed valuable to analyse, especially given how typically misunderstood volatile is. Some more data points - or perhaps more likely, points against it - must be good.
Input
Given this code:
#include <cstddef>
void f(void *const p, std::size_t n)
{
unsigned char *y = static_cast<unsigned char *>(p);
volatile unsigned char const x = 42;
// N.B. Yeah, const is weird, but it doesn't change anything
while (n--) {
*y++ = x;
}
}
void g(void *const p, std::size_t n, volatile unsigned char const x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
void h(void *const p, std::size_t n, volatile unsigned char const &x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
int main(int, char **)
{
int y[1000];
f(&y, sizeof y);
volatile unsigned char const x{99};
g(&y, sizeof y, x);
h(&y, sizeof y, x);
}
Output
g++ from gcc (Debian 4.9.2-10) 4.9.2 (Debian stable a.k.a. Jessie) with the command line g++ -std=c++14 -O3 -S test.cpp produces the below ASM for main(). Version Debian 5.4.0-6 (current unstable) produces equivalent code, but I just happened to run the older one first, so here it is:
main:
.LFB3:
.cfi_startproc
# f()
movb $42, -1(%rsp)
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L21:
subq $1, %rax
movzbl -1(%rsp), %edx
jne .L21
# x = 99
movb $99, -2(%rsp)
movzbl -2(%rsp), %eax
# g()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L22:
subq $1, %rax
jne .L22
# h()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L23:
subq $1, %rax
movzbl -2(%rsp), %edx
jne .L23
# return 0;
xorl %eax, %eax
ret
.cfi_endproc
Analysis
All 3 functions are inlined, and both that allocate volatile local variables do so on the stack for fairly obvious reasons. But that's about the only thing they share...
f() ensures to read from x on each iteration, presumably due to its volatile - but just dumps the result to edx, presumably because the destination y isn't declared volatile and is never read, meaning changes to it can be nixed under the as-if rule. OK, makes sense.
Well, I mean... kinda. Like, not really, because volatile is really for hardware registers, and clearly a local value can't be one of those - and can't otherwise be modified in a volatile way unless its address is passed out... which it's not. Look, there's just not a lot of sense to be had out of volatile local values. But C++ lets us declare them and tries to do something with them. And so, confused as always, we stumble onwards.
g(): What. By moving the volatile source into a pass-by-value parameter, which is still just another local variable, GCC somehow decides it's not or less volatile, and so it doesn't need to read it every iteration... but it still carries out the loop, despite its body now doing nothing.
h(): By taking the passed volatile as pass-by-reference, the same effective behaviour as f() is restored, so the loop does volatile reads.
This case alone actually makes practical sense to me, for reasons outlined above against f(). To elaborate: Imagine x refers to a hardware register, of which every read has side-effects. You wouldn't want to skip any of those.
Adding #define volatile /**/ leads to main() being a no-op, as you'd expect. So, when present, even on a local variable volatile does do something... I just have no idea what in the case of g(). What on Earth is going on there?
Questions
Why does a local value declared in-body produce different results from a by-value parameter, with the former letting reads be optimised away? Both are declared volatile. Neither have an address passed out - and don't have a static address, ruling out any inline-ASM POKEry - so they can never be modified outwith the function. The compiler can see that each is constant, need never be re-read, and volatile just ain't true -
so (A) is either allowed to be elided under such constraints? (acting as-if they weren't declared volatile) -
and (B) why does only one get elided? Are some volatile local variables more volatile than others?
Setting aside that inconsistency for just a moment: After the read was optimised away, why does the compiler still generate the loop? It does nothing! Why doesn't the optimiser elide it as-if no loop was coded?
Is this a weird corner case due to order of optimising analyses or such? As the code is a daft thought-experiment, I wouldn't chastise GCC for this, but it'd be good to know for sure. (Or is g() the manual timing loop people have dreamt of all these years?) If we conclude there's no Standard bearing on any of this, I'll move it to their Bugzilla just for their information.
And of course, the more important question from a practical perspective, though I don't want that to overshadow the potential for compiler geekery... Which, if any of these, are well-defined/correct according to the Standard?

For f: GCC eliminates the non-volatile stores (but not the loads, which can have side-effects if the source location is a memory mapped hardware register). There is really nothing surprising here.
For g: Because of the x86_64 ABI the parameter x of g is allocated in a register (i.e. rdx) and does not have a location in memory. Reading a general purpose register does not have any observable side effects so the dead read gets eliminted.

CPU overhead for struct?

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;

"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret

No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.

My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.

The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.

If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Is extra time required for float vs int comparisons?

If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).

Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.

If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.

Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.

Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.

Is there any overhead to declaring a variable within a loop? (C++) [duplicate]

This question already has answers here:
Declaring variables inside loops, good practice or bad practice?
(9 answers)
Closed 19 days ago.
I am just wondering if there would be any loss of speed or efficiency if you did something like this:
int i = 0;
while(i < 100)
{
int var = 4;
i++;
}
which declares int var one hundred times. It seems to me like there would be, but I'm not sure. would it be more practical/faster to do this instead:
int i = 0;
int var;
while(i < 100)
{
var = 4;
i++;
}
or are they the same, speedwise and efficiency-wise?

Stack space for local variables is usually allocated in function scope. So no stack pointer adjustment happens inside the loop, just assigning 4 to var. Therefore these two snippets have the same overhead.

For primitive types and POD types, it makes no difference. The compiler will allocate the stack space for the variable at the beginning of the function and deallocate it when the function returns in both cases.
For non-POD class types that have non-trivial constructors, it WILL make a difference -- in that case, putting the variable outside the loop will only call the constructor and destructor once and the assignment operator each iteration, whereas putting it inside the loop will call the constructor and destructor for every iteration of the loop. Depending on what the class' constructor, destructor, and assignment operator do, this may or may not be desirable.

They are both the same, and here's how you can find out, by looking at what the compiler does (even without optimisation set to high):
Look at what the compiler (gcc 4.0) does to your simple examples:
1.c:
main(){ int var; while(int i < 100) { var = 4; } }
gcc -S 1.c
1.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
2.c
main() { while(int i < 100) { int var = 4; } }
gcc -S 2.c
2.s:
_main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $0, -16(%ebp)
jmp L2
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
leave
ret
From these, you can see two things: firstly, the code is the same in both.
Secondly, the storage for var is allocated outside the loop:
subl $24, %esp
And finally the only thing in the loop is the assignment and condition check:
L3:
movl $4, -12(%ebp)
L2:
cmpl $99, -16(%ebp)
jle L3
Which is about as efficient as you can be without removing the loop entirely.

These days it is better to declare it inside the loop unless it is a constant as the compiler will be able to better optimize the code (reducing variable scope).
EDIT: This answer is mostly obsolete now. With the rise of post-classical compilers, the cases where the compiler can't figure it out are getting rare. I can still construct them but most people would classify the construction as bad code.

Most modern compilers will optimize this for you. That being said I would use your first example as I find it more readable.

For a built-in type there will likely be no difference between the 2 styles (probably right down to the generated code).
However, if the variable is a class with a non-trivial constructor/destructor there could well be a major difference in runtime cost. I'd generally scope the variable to inside the loop (to keep the scope as small as possible), but if that turns out to have a perf impact I'd look to moving the class variable outside the loop's scope. However, doing that needs some additional analysis as the semantics of the ode path may change, so this can only be done if the sematics permit it.
An RAII class might need this behavior. For example, a class that manages file access lifetime might need to be created and destroyed on each loop iteration to manage the file access properly.
Suppose you have a LockMgr class that acquires a critical section when it's constructed and releases it when destroyed:
while (i< 100) {
LockMgr lock( myCriticalSection); // acquires a critical section at start of
// each loop iteration
// do stuff...
} // critical section is released at end of each loop iteration
is quite different from:
LockMgr lock( myCriticalSection);
while (i< 100) {
// do stuff...
}

Both loops have the same efficiency. They will both take an infinite amount of time :) It may be a good idea to increment i inside the loops.

I once ran some perfomance tests, and to my surprise, found that case 1 was actually faster! I suppose this may be because declaring the variable inside the loop reduces its scope, so it gets free'd earlier. However, that was a long time ago, on a very old compiler. Im sure modern compilers do a better job of optimizing away the diferences, but it still doesn't hurt to keep your variable scope as short as possible.

#include <stdio.h>
int main()
{
for(int i = 0; i < 10; i++)
{
int test;
if(i == 0)
test = 100;
printf("%d\n", test);
}
}
Code above always prints 100 10 times which means local variable inside loop is only allocated once per each function call.

The only way to be sure is to time them. But the difference, if there is one, will be microscopic, so you will need a mighty big timing loop.
More to the point, the first one is better style because it initializes the variable var, while the other one leaves it uninitialized. This and the guideline that one should define variables as near to their point of use as possible, means that the first form should normally be preferred.

With only two variables, the compiler will likely be assign a register for both. These registers are there anyway, so this doesn't take time. There are 2 register write and one register read instruction in either case.

I think that most answers are missing a major point to consider which is: "Is it clear" and obviously by all the discussion the fact is; no it is not.
I'd suggest in most loop code the efficiency is pretty much a non-issue (unless you calculating for a mars lander), so really the only question is what looks more sensible and readable & maintainable - in this case I'd recommend declaring the variable up front & outside the loop - this simply makes it clearer. Then people like you & I would not even bother to waste time checking online to see if it's valid or not.

thats not true
there is overhead however its neglect able overhead.
Even though probably they will end up at same place on stack It still assigns it.
It will assign memory location on stack for that int and then free it at the end of }. Not in heap free sense in sense it will move sp (stack pointer) by 1.
And in your case considering it only has one local variable it will just simply equate fp(frame pointer) and sp
Short answer would be: DONT CARE EITHER WAY WORKS ALMOST THE SAME.
But try reading more on how stack is organized. My undergrad school had pretty good lectures on that
If you wanna read more check here
http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Assembler1/lecture.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++: Structs slower to access than basic variables? - c++

Rule of thumb: it's not slow, unless profiler says it is. Let the compiler worry about micro-optimisations (they're pretty smart about them; after all, they've been doing it for years) and focus on the bigger picture.

Compiler may make faster code to copy float-to-float. But when x will used it will be converted to internal FPU representation.

Related

Why can't GCC assume that std::vector::size won't change in this loop?

Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

CPU overhead for struct?

Is extra time required for float vs int comparisons?

Is there any overhead to declaring a variable within a loop? (C++) [duplicate]

Categories

Resources