optimization of access to members in c++ - c++

I'm running into an inconsistent optimization behavior with different compilers for the following code:
class tester
{
public:
tester(int* arr_, int sz_)
: arr(arr_), sz(sz_)
{}
int doadd()
{
sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
protected:
int* arr;
int sz;
int sm;
};
The doadd function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:
int arradd(int* arr, int sz)
{
int sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The doadd method runs about 1.5 times slower than the arradd function when compiled in Release mode with Visual C++ 2008. When I modify the doadd method to be as follows (aliasing all members with locals):
int doadd()
{
int mysm = 0;
int* myarr = arr;
int mysz = sz;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < mysz; ++i)
{
mysm += myarr[i];
}
}
sm = mysm;
return sm;
}
Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++ seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2 or -O3.
The benchmarking is done by invoking the doadd member and arradd function on some sufficiently large array (a few millions of integers in size).
EDIT: Some fine-grained testing shows that the main culprit is the sm member. Replacing all others by local versions still makes the runtime long, but once I replace sm by mysm the runtime becomes equal to the function version.
Resolution
Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.

It may be an aliasing issue - the compiler can not know that the instance variable sm will never be pointed at by arr, so it has to treat sm as if it were effectively volatile, and save it on every iteration. You could make sm a different type to test this hypothesis. Or just use a temporary local sum (which will get cached in a register) and assign it to sm at the end.

MSVC is correct, in that it is the only one that, given the code we've seen, is guaranteed to work correctly. GCC employs optimizations that are probably safe in this specific instance, but that can only be verified by seeing more of the program.
Because sm is not a local variable, MSVC apparently assumes that it might alias arr.
That's a fairly reasonable assumption: because arr is protected, a derived class might set it to point to sm, so arr could alias sm.
GCC sees that it doesn't actually alias arr, and so it doesn't write sm back to memory until after the loop, which is much faster.
It's certainly possible to instantiate the class so that arr points to sm, which MSVC would handle, but GCC wouldn't.
Assuming that sz > 1, GCCs optimization is permissible in general.
Because the function loops over arr, treating it as an array of sz elements, calling the function with sz > 1 would yield undefined behavior whether or not arr aliases sm, and so GCC could safely assume that they don't alias. But if sz == 1, or if the compiler can't be sure what sz's value might be, then it runs the risk that sz might be 1, and so arr and sm could alias perfectly legally, and GCC's code would break.
So most likely, GCC simply gets away with it by inlining the whole thing, and seeing that in this case, they don't alias.

I disassembled the code with MSVC to better understand what's going on. Turns out aliasing wasn't a problem at all, and neither was some kind of paranoid thread safety.
Here is the interesting part of the arradd function disassambled:
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
013C101C mov ecx,ebp
013C101E mov ebx,29B9270h
{
sm += arr[i];
013C1023 add eax,dword ptr [ecx-8]
013C1026 add edx,dword ptr [ecx-4]
013C1029 add esi,dword ptr [ecx]
013C102B add edi,dword ptr [ecx+4]
013C102E add ecx,10h
013C1031 sub ebx,1
013C1034 jne arradd+23h (13C1023h)
013C1036 add edi,esi
013C1038 add edi,edx
013C103A add eax,edi
013C103C sub dword ptr [esp+10h],1
013C1041 jne arradd+16h (13C1016h)
013C1043 pop edi
013C1044 pop esi
013C1045 pop ebp
013C1046 pop ebx
ecx points to the array, and we can see that the internal loop is unrolled x4 here - note the four consecutive add instructions from following addresses, and ecx being advanced by 16 bytes (4 words) at a time inside the loop.
For the unoptimized version of the member function, doadd:
int tester::doadd()
{
sm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The disassembly is (it's harder to find since the compiler inlined it into main):
int tr_result = tr.doadd();
013C114A xor edi,edi
013C114C lea ecx,[edi+0Ah]
013C114F nop
013C1150 xor eax,eax
013C1152 add edi,dword ptr [esi+eax*4]
013C1155 inc eax
013C1156 cmp eax,0A6E49C0h
013C115B jl main+102h (13C1152h)
013C115D sub ecx,1
013C1160 jne main+100h (13C1150h)
Note 2 things:
The sum is stored in a register - edi. Hence, there's not aliasing "care" taken here. The value of sm isn't re-read all the time. edi isinitialized just once and then used as a temporary. You don't see its return since the compiler optimized it and used edi directly as the return value of the inlined code.
The loop is not unrolled. Why? No good reason.
Finally, here's an "optimized" version of the member function, with mysm keeping the sum local manually:
int tester::doadd_opt()
{
sm = 0;
int mysm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
mysm += arr[i];
}
}
sm = mysm;
return sm;
}
The (again, inlined) disassembly is:
int tr_result_opt = tr_opt.doadd_opt();
013C11F6 xor edi,edi
013C11F8 lea ebp,[edi+0Ah]
013C11FB jmp main+1B0h (13C1200h)
013C11FD lea ecx,[ecx]
013C1200 xor ecx,ecx
013C1202 xor edx,edx
013C1204 xor eax,eax
013C1206 add ecx,dword ptr [esi+eax*4]
013C1209 add edx,dword ptr [esi+eax*4+4]
013C120D add eax,2
013C1210 cmp eax,0A6E49BFh
013C1215 jl main+1B6h (13C1206h)
013C1217 cmp eax,0A6E49C0h
013C121C jge main+1D1h (13C1221h)
013C121E add edi,dword ptr [esi+eax*4]
013C1221 add ecx,edx
013C1223 add edi,ecx
013C1225 sub ebp,1
013C1228 jne main+1B0h (13C1200h)
The loop here is unrolled, but just x2.
This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).
What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc's disassembly and indeed: gcc doesn't unroll any of the versions.

As Paul wrote it is probably because sm member is really updated every time in the "real" memory , meanwhile local summary in the function can be accumulated in register variable (after compiler optimization).

You can get similar issues when passing in pointer arguments. If you like getting your hands dirty, you may find the restrict keyword useful in future.
http://developers.sun.com/solaris/articles/cc_restrict.html

This isn't really the same code at all. If you put the sm, arr and sz variables inside the class instead of making theme local, the compiler can't (easily) guess that some other class won't inherit from test class and want access to these members, doing something like `arr=&sm; doadd();. Henceforth, access to these variables can't be optimized away as they can when they are local to function.
In the end the reason is basically the one Paul pointed out, sm is updated in real memory when using a class member, can be stored in a register when in a function. Memory reads from add shouldn't change resulting time much, as memomry must be read anyway to get the value.
In this case if test is not exported to another module and not aliased even indirectly to something exported, and if there is no aliasing like above. The compiler could optimize intermediate writes to sm... some compilers like gcc seems to optimize aggressively enough to detect above cases (would it also if test class is exported). But these are really hard guesses. There is still much simpler optimizations that are not yet performed by compilers (like inlining through different compilation units).

The key is probably that doadd is written like this if you make the member accesses explicit with this:
int doadd()
{
this->sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < this->sz; ++i)
{
this->sm += this->arr[i];
}
}
return this->sm;
}
Therein lies the problem: all class members are accessed via the this pointer, whereas arradd has all variables on the stack. To speed it up, you have found that by moving all members on to the stack as local variables, the speed then matches arradd. So this indicates the this indirection is responsible for the performance loss.
Why might that be? As I understand it this is usually stored in a register so I don't think it's ultimately any slower than just accessing the stack (which is an offset in to the stack pointer as well). As other answers point out, it's probably the aliasing problem that generates less optimal code: the compiler can't tell if any of the memory addresses overlap. Updating sm could also in theory change the content of arr, so it decides to write out the value of sm to main memory every time rather than tracking it in a register. When variables are on the stack, the compiler can assume they're all at different memory addresses. The compiler doesn't see the program as clearly as you do: it can tell what's on the stack (because you declared it like that), but everything else is just arbitrary memory addresses that could be anything, anywhere, overlapping any other pointer.
I'm not surprised the optimisation in your question (using local variables) isn't made - not only would the compiler have to prove the memory of arr does not overlap anything pointed to by this, but also that not updating the member variables until the end of the function is equivalent to the un-optimised version updating throughout the function. That can be a lot trickier to determine than you imagine, especially if you take concurrency in to account.

Related

Signed overflow in C++ and undefined behaviour (UB)

I'm wondering about the use of code like the following
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
If the loop is iterated over n times, then factor is multiplied by 10 exactly n times. However, factor is only ever used after having been multiplied by 10 a total of n-1 times. If we assume that factor never overflows except on the last iteration of the loop, but may overflow on the last iteration of the loop, then should such code be acceptable? In this case, the value of factor would provably never be used after the overflow has happened.
I'm having a debate on whether code like this should be accepted. It would be possible to put the multiplication inside an if-statement and just not do the multiplication on the last iteration of the loop when it can overflow. The downside is that it clutters the code and adds an unnecessary branch that would need to check for on all the previous loop iterations. I could also iterate over the loop one fewer time and replicate the loop body once after the loop, again, this complicates the code.
The actual code in question is used in a tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application.
Compilers do assume that a valid C++ program does not contain UB. Consider for example:
if (x == nullptr) {
*x = 3;
} else {
*x = 5;
}
If x == nullptr then dereferencing it and assigning a value is UB. Hence the only way this could end in a valid program is when x == nullptr will never yield true and the compiler can assume under the as if rule, the above is equivalent to:
*x = 5;
Now in your code
int result = 0;
int factor = 1;
for (...) { // Loop until factor overflows but not more
result = ...
factor *= 10;
}
return result;
The last multiplication of factor cannot happen in a valid program (signed overflow is undefined). Hence also the assignment to result cannot happen. As there is no way to branch before the last iteration also the previous iteration cannot happen. Eventually, the part of code that is correct (i.e., no undefined behaviour ever happens) is:
// nothing :(
The behaviour of int overflow is undefined.
It doesn't matter if you read factor outside the loop body; if it has overflowed by then then the behaviour of your code on, after, and somewhat paradoxically before the overflow is undefined.
One issue that might arise in keeping this code is that compilers are getting more and more aggressive when it comes to optimisation. In particular they are developing a habit where they assume that undefined behaviour never happens. For this to be the case, they may remove the for loop altogether.
Can't you use an unsigned type for factor although then you'd need to worry about unwanted conversion of int to unsigned in expressions containing both?
It might be insightful to consider real-world optimizers. Loop unrolling is a known technique. The basic idea of loop unrolling is that
for (int i = 0; i != 3; ++i)
foo()
might be better implemented behind the scenes as
foo()
foo()
foo()
This is the easy case, with a fixed bound. But modern compilers can also do this
for variable bounds:
for (int i = 0; i != N; ++i)
foo();
becomes
__RELATIVE_JUMP(3-N)
foo();
foo();
foo();
Obviously this only works if the compiler knows that N<=3. And that's where we get back to the original question:
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
Because the compiler knows that signed overflow does not occur, it knows that the loop can execute a maximum of 9 times on 32 bits architectures. 10^10 > 2^32. It can therefore do a 9 iteration loop unroll. But the intended maximum was 10 iterations !.
What might happen is that you get a relative jump to a assembly instruction (9-N) with N==10, so an offset of -1, which is the jump instruction itself. Oops. This is a perfectly valid loop optimization for well-defined C++, but the example given turns into a tight infinite loop.
Any signed integer overflow results in undefined behaviour, regardless of whether or not the overflowed value is or might be read.
Maybe in your use-case you can to lift the first iteration out of the loop, turning this
int result = 0;
int factor = 1;
for (int n = 0; n < 10; ++n) {
result += n + factor;
factor *= 10;
}
// factor "is" 10^10 > INT_MAX, UB
into this
int factor = 1;
int result = 0 + factor; // first iteration
for (int n = 1; n < 10; ++n) {
factor *= 10;
result += n + factor;
}
// factor is 10^9 < INT_MAX
With optimization enabled, the compiler might unroll the second loop above into one conditional jump.
This is UB; in ISO C++ terms the entire behaviour of the entire program is completely unspecified for an execution that eventually hits UB. The classic example is as far as the C++ standard cares, it can make demons fly out of your nose. (I recommend against using an implementation where nasal demons are a real possibility). See other answers for more details.
Compilers can "cause trouble" at compile time for paths of execution they can see leading to compile-time-visible UB, e.g. assume those basic blocks are never reached.
See also What Every C Programmer Should Know About Undefined Behavior (LLVM blog). As explained there, signed-overflow UB lets compilers prove that for(... i <= n ...) loops are not infinite loops, even for unknown n. It also lets them "promote" int loop counters to pointer width instead of redoing sign-extension. (So the consequence of UB in that case could be accessing outside the low 64k or 4G elements of an array, if you were expecting signed wrapping of i into its value range.)
In some cases compilers will emit an illegal instruction like x86 ud2 for a block that provably causes UB if ever executed. (Note that a function might not ever be called, so compilers can't in general go berserk and break other functions, or even possible paths through a function that don't hit UB. i.e. the machine code it compiles to must still work for all inputs that don't lead to UB.)
Probably the most efficient solution is to manually peel the last iteration so the unneeded factor*=10 can be avoided.
int result = 0;
int factor = 1;
for (... i < n-1) { // stop 1 iteration early
result = ...
factor *= 10;
}
result = ... // another copy of the loop body, using the last factor
// factor *= 10; // and optimize away this dead operation.
return result;
Or if the loop body is large, consider simply using an unsigned type for factor. Then you can let the unsigned multiply overflow and it will just do well-defined wrapping to some power of 2 (the number of value bits in the unsigned type).
This is fine even if you use it with signed types, especially if your unsigned->signed conversion never overflows.
Conversion between unsigned and 2's complement signed is free (same bit-pattern for all values); the modulo wrapping for int -> unsigned specified by the C++ standard simplifies to just using the same bit-pattern, unlike for one's complement or sign/magnitude.
And unsigned->signed is similarly trivial, although it is implementation-defined for values larger than INT_MAX. If you aren't using the huge unsigned result from the last iteration, you have nothing to worry about. But if you are, see Is conversion from unsigned to signed undefined?. The value-doesn't-fit case is implementation-defined, which means that an implementation must pick some behaviour; sane ones just truncate (if necessary) the unsigned bit pattern and use it as signed, because that works for in-range values the same way with no extra work. And it's definitely not UB. So big unsigned values can become negative signed integers. e.g. after int x = u; gcc and clang don't optimize away x>=0 as always being true, even without -fwrapv, because they defined the behaviour.
If you can tolerate a few additional assembly instructions in the loop, instead of
int factor = 1;
for (int j = 0; j < n; ++j) {
...
factor *= 10;
}
you can write:
int factor = 0;
for (...) {
factor = 10 * factor + !factor;
...
}
to avoid the last multiplication. !factor will not introduce a branch:
xor ebx, ebx
L1:
xor eax, eax
test ebx, ebx
lea edx, [rbx+rbx*4]
sete al
add ebp, 1
lea ebx, [rax+rdx*2]
mov edi, ebx
call consume(int)
cmp r12d, ebp
jne .L1
This code
int factor = 0;
for (...) {
factor = factor ? 10 * factor : 1;
...
}
also results in branchless assembly after optimization:
mov ebx, 1
jmp .L1
.L2:
lea ebx, [rbx+rbx*4]
add ebx, ebx
.L1:
mov edi, ebx
add ebp, 1
call consume(int)
cmp r12d, ebp
jne .L2
(Compiled with GCC 8.3.0 -O3)
You didn't show what's in the parentheses of the for statement, but I'm going to assume it's something like this:
for (int n = 0; n < 10; ++n) {
result = ...
factor *= 10;
}
You can simply move the counter increment and loop termination check into the body:
for (int n = 0; ; ) {
result = ...
if (++n >= 10) break;
factor *= 10;
}
The number of assembly instructions in the loop will remain the same.
Inspired by Andrei Alexandrescu's presentation "Speed Is Found In The Minds of People".
Consider the function:
unsigned mul_mod_65536(unsigned short a, unsigned short b)
{
return (a*b) & 0xFFFFu;
}
According to the published Rationale, the authors of the Standard would have expected that if this function were invoked on (e.g.) a commonplace 32-bit computer with arguments of 0xC000 and 0xC000, promoting the operands of * to signed int would cause the computation to yield -0x10000000, which when converted to unsigned would yield 0x90000000u--the same answer as if they had made unsigned short promote to unsigned. Nonetheless, gcc will sometimes optimize that function in ways that would behave nonsensically if an overflow occurs. Any code where some combination of inputs could cause an overflow must be processed with -fwrapv option unless it would be acceptable to allow creators of deliberately-malformed input to execute arbitrary code of their choosing.
Why not this:
int result = 0;
int factor = 10;
for (...) {
factor *= 10;
result = ...
}
return result;
There are many different faces of Undefined Behavior, and what's acceptable depends on the usage.
tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application
That, by itself, is a bit of an unusual thing, but be that as it may... if this is indeed the case, then the UB is most probably within the realm "allowable, acceptable". Graphics programming is notorious for hacks and ugly stuff. As long as it "works" and it doesn't take longer than 16.6ms to produce a frame, usually, nobody cares. But still, be aware of what it means to invoke UB.
First, there is the standard. From that point of view, there's nothing to discuss and no way to justify, your code is simply invalid. There are no ifs or whens, it just isn't a valid code. You might as well say that's middle-finger-up from your point of view, and 95-99% of the time you'll be good to go anyway.
Next, there's the hardware side. There are some uncommon, weird architectures where this is a problem. I'm saying "uncommon, weird" because on the one architecture that makes up 80% of all computers (or the two architectures that together make up 95% of all computers) overflow is a "yeah, whatever, don't care" thing on the hardware level. You sure do get a garbage (although still predictable) result, but no evil things happen.
That is not the case on every architecture, you might very well get a trap on overflow (though seeing how you speak of a graphics application, the chances of being on such an odd architecture are rather small). Is portability an issue? If it is, you may want to abstain.
Last, there is the compiler/optimizer side. One reason why overflow is undefined is that simply leaving it at that was easiest to cope with hardware once upon a time. But another reason is that e.g. x+1 is guaranteed to always be larger than x, and the compiler/optimizer can exploit this knowledge. Now, for the previously mentioned case, compilers are indeed known to act this way and simply strip out complete blocks (there existed a Linux exploit some years ago which was based on the compiler having dead-stripped some validation code because of exactly this).
For your case, I would seriously doubt that the compiler does some special, odd, optimizations. However, what do you know, what do I know. When in doubt, try it out. If it works, you are good to go.
(And finally, there's of course code audit, you might have to waste your time discussing this with an auditor if you're unlucky.)

Why with or without const Modifier make efficiency diff 4 times?

Why with or without const Modifier make efficiency diff 4 times? This code need about 16 second to finish in my PC. But if I make a small change, like declare mod as const int or move the mod declaration in the main body, or change i as int type, the execute time reduced to 4 second. (I compile this code use g++ with default parameters)
Here is the assembly code for this code, the left part is generate with non-const int mod, another with const int mod declaration.
The big efficiency occur only when I declare i as long long and the operator in for loop is '%'. Otherwise the performance only diff about 10%.
// const int mod = 1000000009;
int mod = 1000000009;
int main(){
// int mod = 1000000009;
int rel = 0;
for(long long i=1e9; i<2*(1e9); i++){
rel = i % mod;
}
return 0;
}
Because when you add const,the compile changes it into the constant value and write it into the assemble codes, but when you do not add const, the value will be loaded into the register, so you must query it every time you have to use it
When loading the value of mod from memory into a register, the generated assembly code is different.
For example, this is what you get when using the Visual Studio 2013 compiler for x64 based processor:
For int mod = 1000000009:
mov eax,dword ptr ds:[xxxxxxxxh] ; xxxxxxxxh = &mod
cdq
push edx
push eax
For const int mod = 1000000009:
push 0
push 3B9ACA09h ; 3B9ACA09h = 1000000009
A const variable may or may not take space on stack - that's upto the compiler. But, in most cases, a const-variable's usage will be replaced by its constant value. Consider:
const int size = 100;
int* pModify = (int*)&size;
*pModify = 200;
int Array[size];
When you use *pModify it will render 200, but the size of array would be 100 elements only (ignore compiler extensions of new features that allow variable size arrays). It is just because compiler has replaced [size] with [100]. When you use size, it will mostly be just 100.
In that loop %mod is just getting replaced with %1000000009. There is one read-memory (load) instruction less, that's why it is performing fast.
But, it must be noted that compilers act smart, very smart, so you cannot guess what optimization technique it might have applied. It might have removed all of loop (since it seems no-op to the compiler).

Are variables inside a loop (while or for) disposed after the loop has completed?

Are variables that are created inside a while or for loop disposed/deleted from memory after the loop is done executing? also, is it a bad coding habit to create temporary variables inside a loop?
in this example, does it create 100 X variables and then dispose of them, or are they disposed on each iteration? thanks.
example:
int cc =0;
while(cc < 100){
int X = 99; // <-- this variable
cc++;
}
Scope and lifetime are two different things. For variables defined at block scope without static, they're more or less tightly linked, but they're still distinct concepts -- and you can shoot yourself in the foot if you don't keep them straight.
Quoting the snippet from the question:
int cc =0;
while(cc < 100){
int X = 99; // <-- this variable
cc++;
}
The scope of X is the region of program text in which its name is visible. It extends from the point at which it's defined to the end of the enclosing block, which is delimited by the { and } characters. (The fact that the block happens to be part of a while statement is not directly relevant; it's the block itself that defines the scope.)
Inside the block, the name X refers to that int variable. Outside the block, the name X is not visible.
The lifetime of X is the time during program execution when X logically exists. It begins when execution reaches the opening { (before the definition), and ends when execution reaches the closing }. If the block is executed 100 times, then X is created and "destroyed" 100 times, and has 100 disjoint lifetimes.
Although the name X is visible only within its scope, the object called X may be accessed (directly or indirectly) any time within its lifetime. For example, if we pass &X to a function, then that function may read and update the object, even though the function is completely outside its scope.
You can take the address of X, and save that address for use after its lifetime has ended -- but doing so causes undefined behavior. A pointer to an object whose lifetime has ended is indeterminate, and any attempt to dereference it -- or even to refer to the pointer's value -- has undefined behavior.
Nothing in particular actually has to happen when an object reaches the end of its lifetime. The language just requires the object to exist during its lifetime; outside that, all bets are off. The stack space allocated to hold the object might be deallocated (which typically just means moving the stack pointer), or, as an optimization, the stack space might remain allocated and re-used for the next iteration of the loop.
Also, is it a bad coding habit to create temporary variables inside a loop?
Not at all. As long as you don't save the object's address past the end of its lifetime, there's nothing wrong with it. The allocation and deallocation will very often be done on entry to and exit from the function rather than the block, as a performance optimization. Restricting variables to a reasonably tight scope is a very good coding habit. It makes the code easier to understand by restricting unnecessary interactions between different parts of the code. If X is defined inside the body of the loop, nothing outside the loop can affect it (if you haven't done something too tricky), which makes it easier to reason about the code.
UPDATE: If X were of a type with a non-trivial constructor and/or destructor, creating and destruction of X actually has to be performed on entry to and exit from the block (unless the compiler is able to optimize that code away). For an object of type int, that isn't an issue.
Yes, the variable is created and destroyed N times, unless the compiler optimizes it somehow (which it can, I believe). It's not a very big deal when you have just one int though. It becomes more problematic when you have some complex object recreated 99 times inside your loop.
A small practical example:
#include <iostream>
using namespace std;
class TestClass
{
public:
TestClass();
~TestClass();
};
TestClass::TestClass()
{
cout<<"Created."<<endl;
}
TestClass::~TestClass()
{
cout<<"Destroyed"<<endl;
}
int main() {
for (int i = 0; i < 5; ++i)
{
TestClass c;
}
return 0;
}
In this code TestClass c will be recreated 5 times. IdeOne Demo.
Yes. Any variable defined within a scope { }:
if (true) { // Forgive me father for this heresy :)
int a = 0;
}
a = 1; // This will throw an error
Is automatically deallocated once it goes out of scope.
This is also true in this case:
for (int i = 0; i < 10; ++i) {
// Do stuff
...
}
i = 1; // This will throw an error
The scope of i is limited to the for loop, even if it was not declared inside a pair of { }.
As pointed out before, in truth there's no need no create/destroy a local variable repeatedly.
That's an unnecessary waste of CPU time.
Example, compiled with "gcc -S -masm=intel -fverbose-asm"
int crazyloop (int n) {
if (n > 0) {
int i;
for (i = 0; i < n; i++)
;
}
return n;
}
the corresponding assembly listing is:
_Z9crazyloopi:
.LFB0:
.cfi_startproc
push rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp #,
.cfi_def_cfa_register 6
mov DWORD PTR [rbp-20], edi # n, n
cmp DWORD PTR [rbp-20], 0 # n,
jle .L2 #,
mov DWORD PTR [rbp-4], 0 # i,
.L4:
mov eax, DWORD PTR [rbp-4] # tmp89, i
cmp eax, DWORD PTR [rbp-20] # tmp89, n
jge .L2 #,
add DWORD PTR [rbp-4], 1 # i,
jmp .L4 #
.L2:
mov eax, DWORD PTR [rbp-20] # D.3136, n
pop rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
The interesting part lies in the references to register RBP.
Once set, it does not change. Variable "i" is always at [rbp-4]. (Variations of the code, with more vars, etc., gave the same results = there's no repeated allocation/deallocation = modifications to the stack top position).
It is the most sensible thing to do: think of a loop that iterates trillions of times.
Would another compiler do it differently? Possibly yes, but, why on earth it would do that?
Security might be a concern? Unlikely; in that case the programmer should simply overwrite a variable before letting it vanish.
Yes
If you want to have access to the variable after the loop you should declare it ouside of the loop
while(...) {
int i = 0;
i++; // valid
}
i++; // invalid, i doesnt exist in this context
i outside of the loop
int i = 0;
while(...) {
i++; // valid
}
i++; // valid
the lifespan of a viariable is limited to the context {...} in which it was created
Is we where considering an object, the destructor would be called when reaching }
Yes, they are destroyed when they go out of scope. Note that this isn't specific to variables in the loop. This rule applies to all variables with automatic storage duration.
Any variable remain active within it's scope. Outside the scope that particular variable even doesn't exits, forget about accessing its value.
for(;;)
{
int var; // Scope of this variable is within this for loop only.
}
// Outside this for loop variable `var` doesn't exits.
The variables declared inside will have their own scope.
you can use it when you know that you will not use that variabel outside of the scope.
That's compiler job. :)

Performance difference between accessing the member of a heap and a stack object?

Currently I'm using the '->' operator to dereference members inside a class. My question is wether is it faster than normal member accessing. For example:
Class* myClsPtr = new Class();
myClsPtr->foo(bar);
Vs.
Class myCls;
myCls.foo(bar);
Can use both ways without a performence difference?
First,
Class myCls = new Class();
is invalid code... Let us assume you meant
Class myCls;
There will be pretty much no noticable difference, but you could benchmark it yourself by iterating million times in a loop, and call either variant while timing both execution time.
I have just made a quick and dirty benchmark on my laptop with the iteration of one hundred million as follows:
Stack Object
struct MyStruct
{
int i;
};
int main()
{
MyStruct stackObject;
for (int i = 0; i < 100000000; ++i)
stackObject.i = 0;
return 0;
}
and then I ran:
g++ main.cpp && time ./a.out
the result is:
sreal 0m0.301s
user 0m0.303s
sys 0m0.000s
Heap Object
struct MyStruct
{
int i;
};
int main()
{
MyStruct *heapObject = new MyStruct();
for (int i = 0; i < 100000000; ++i)
heapObject->i = 5;
return 0;
}
and then I ran:
g++ main.cpp && time ./a.out
the result is:
real 0m0.253s
user 0m0.250s
sys 0m0.000s
As you can see, the heap object is slightly faster on my machine for 100 millions of iteration. Even on my machine, this would be unnoticable for significantly fewer items. One thing that stands out is that, although the results are slightly distinct for subsequent runs, the heap object version is always performing better on my laptop. Do not take it as a guarantee, however.
As with so many performance questions, the answer is complicated and variable. The potential sources of slowness using the heap are:
Time to allocate and deallocate objects.
The possibility that the object is not in the cache.
Both of these mean an object on the heap might be slow at first. But this wont matter much if you use the object many times in a tight loop: soon the object will end up in the CPU cache whether it lives in the heap or stack.
A related issue is whether objects that contain other objects should use pointers or copies. If speed is the only issue, it is probably better to store copies, because each new pointer lookup is a potential cache-miss.
Since the a->b is equivalent to (*a).b (and that's indeed what the compiler must create, at least logically) -> could be slower than ., if at all. In practice the compiler will likely store a's address in a register and add the offset b immediately, skipping the (*a) part and reducing it effectively to a.b internally.
With -O3 gcc 4.8.2 eliminates the whole loop, by the way. It even does that if we return the last MyStruct::i from main -- the loop is side effect free and the end value is trivially computable. Just another bench-remark.
And then it's not about the object being on the heap but it's about using an address vs. using an object right away. The logic would be the same for the same object:
MyStruct m;
mp = &m;
and then run your two loops, with m or mp resepctively. The position (in terms of which memory page it is on) of an object may matter a lot more than whether you access it directly or via a pointer because locality tends to be important in modern architectures (with caches and parallelism). If some memory is already in a cached memory location (the stack may well be cached) it's much faster to access than some location which must be loaded into the cache first (some arbitrary heap location). In either loop the memory where the object resides will likely stay cached because not much else happens there, but in more realistic scenarios (iterating over pointers in a vector: where do the pointers point to? Scattered or contiguous memory?) these considerations will far outweigh the cheap dereferencing.
I found the results puzzling, so I investigated a little further. First I enhanced the example prog by using chrono and adding one test which accesses the local variable (instead of memory on the heap) through a pointer. That made sure that a timing difference was not caused by the location of the object but by the access method.
Second I added a dummy member to the struct because I noticed that the direct member destination used an offset to the stack pointer which I suspected could be the culprit; the pointer version accessed the memory through a register without offset. The dummy leveled the field there. It didn't make a difference though.
Access through a pointer was significantly faster for both the heap and the local object. Here's the source:
#include<chrono>
#include<iostream>
using namespace std;
using namespace std::chrono;
struct MyStruct { /* offset for i */ int dummy; int i; };
int main()
{
MyStruct *heapPtr = new MyStruct;
MyStruct localObj;
MyStruct *localPtr = &localObj;
///////////// ptr to heap /////////////////////
auto t1 = high_resolution_clock::now();
for (int i = 0; i < 100000000; ++i)
{
heapPtr->i = i;
}
auto t2 = high_resolution_clock::now();
cout << "heap ptr: "
<< duration_cast<milliseconds>(t2-t1).count()
<< " ms" << endl;
////////////////// local obj ///////////////////////
t1 = high_resolution_clock::now();
for (int i = 0; i < 100000000; ++i)
{
localObj.i = i;
}
t2 = high_resolution_clock::now();
cout << "local: "
<< duration_cast<milliseconds>(t2-t1).count()
<< " ms" << endl;
////////////// ptr to local /////////////////
t1 = high_resolution_clock::now();
for (int i = 0; i < 100000000; ++i)
{
localPtr->i = i;
}
t2 = high_resolution_clock::now();
cout << "ptr to local: "
<< duration_cast<milliseconds>(t2-t1).count()
<< " ms" << endl;
/////////// have a side effect ///////////////
return heapPtr->i + localObj.i;
}
Here is a typical run. Differences between heap and local ptr are random in both directions.
heap ptr: 217 ms
local: 236 ms
ptr to local: 206 ms
Here is the disassembly of the pointer and the direct access. I assume that heapPtr's stack offset is 0x38 so that the first mov moves it's contents, i.e. the address of the object on the heap which it points to, to %rax. This serves as the address to move the value to in the third move (with a 4 byte offset due to the preceding dummy member).
The second move gets i's value (i is apparently at stack offset 4C, which lines up if you count all the intervening definitions) into %edx (because the last mov can have at most one memory operand, which is the object, so the value in i must go into a register).
The last mov gets i's value, in register %edx, into the object's address, now in %rax, plus an offset of 4 because of the dummy.
heapPtr->i = i;
3e: 48 8b 45 38 mov 0x38(%rbp),%rax
42: 8b 55 4c mov 0x4c(%rbp),%edx
45: 89 50 04 mov %edx,0x4(%rax)
As was to be expected, the direct access is shorter. The variable's value (different local i, this time at stack offset 0x48) is loaded in register %eax which is then written into the adddress at stack offset -0x60 (I don't know why some local objects are stored at positive offsets and others at negative ones). The bottom line is that this is one instruction shorter than the pointer access; basically, the first instruction of the pointer access, which loads the pointer's value into an address register, is missing. That is exactly what we would expect -- that's the dereferencing. Nonetheless the direct access takes more time. I have no idea why. Since I excluded most possibilities I must assume that either using %rbp is slower than using %rax (unlikely) or that a negative offset slows access down. Is that so?
localObj.i = i;
d6: 8b 45 48 mov 0x48(%rbp),%eax
d9: 89 45 a0 mov %eax,-0x60(%rbp)
It should be noted that gcc moves the assignment out of the loop when optimization is turned on. So this is in a way a phantom problem for people concerned about performance. Additionally these small differences will be drowned out by anything "real" happening in the loops. But it is still unexpected.

C and C++: Array element access pointer vs int

Is there a performance difference if you either do myarray[ i ] or store the adress of myarray[ i ] in a pointer?
Edit: The pointers are all calculated during an unimportant step in my program where performance is no criteria. During the critical parts the pointers remain static and are not modified. Now the question is if these static pointers are faster than using myarray[ i ] all the time.
For this code:
int main() {
int a[100], b[100];
int * p = b;
for ( unsigned int i = 0; i < 100; i++ ) {
a[i] = i;
*p++ = i;
}
return a[1] + b[2];
}
when built with -O3 optimisation in g++, the statement:
a[i] = i;
produced the assembly output:
mov %eax,(%ecx,%eax,4)
and this statement:
*p++ = i;
produced:
mov %eax,(%edx,%eax,4)
So in this case there was no difference between the two. However, this is not and cannot be a general rule - the optimiser might well generate completely different code for even a slightly different input.
It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.
Compilers can do surprising optimizations; the only way to know is to read the generated assembly code.
With GCC, use -S, with -masm=intel for Intel syntax.
With VC++, use /FA (IIRC).
You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.
I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code.
When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.
There should not be much different but by using indexing you avoid all types of different pitfalls that the compiler's optimizer is prone to (aliasing being the most important one) and thus I'd say the indexing case should be easier to handle for the compiler. This doesn't mean that you should take care of aforementioned things before the loop, but pointers in a loop generally just adds to the complexity.
Yes. Having a pointer the address won't be calculated by using the initial address of the array. It will accessed directly. So you have a little performance improve if you save the address in a pointer.
But the compiler will usually optimize the code and use the pointer in both cases (if you have statical arrays)
For dynamic arrays (created with new) the pointer will offer you more performance as the compiler cannot optimize array accesses at compile time.
There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.
Okay so your questions is, whats faster:
int main(int argc, char **argv)
{
int array[20];
array[0] = 0;
array[1] = 1;
int *value_1 = &array[1];
printf("%d", *value_1);
printf("%d", array[1]);
printf("%d", *(array + 1));
}
Like someone else already pointed out, compilers can do clever optimization. Of course this is depending on where an expression is used, but normally you shouldn't care about those subtle differences. All your assumption can be proven wrong by the compiler. Today you shouldn't need to care about such differences.
For example the above code produces the following (only snippet):
mov [ebp+var_54], 1 #store 1
lea eax, [ebp+var_58] # load the address of array[0]
add eax, 4 # add 4 (size of int)
mov [ebp+var_5C], eax
mov eax, [ebp+var_5C]
mov eax, [eax]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000 # points to %d
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
Short answer: the only way to know for sure is to code up both versions and compare performance. I would personally be surprised if there was a measureable difference unless you were doing a lot of array accesses in a really tight loop. If this is something that happens once or twice over the lifetime of the program, or depends on user input, it's not worth worrying about.
Remember that the expression a[i] is evaluated as *(a+i), which is an addition plus a dereference, whereas *p is just a dereference. Depending on how the code is structured, though, it may not make a difference. Assume the following:
int a[N]; // for any arbitrary N > 1
int *p = a;
size_t i;
for (i = 0; i < N; i++)
printf("a[%d] = %d\n", i, a[i]);
for (i = 0; i < N; i++)
printf("*(%p) = %d\n", (void*) p, *p++);
Now we're comparing a[i] to *p++, which is a dereference plus a postincrement (in addition to the i++ in the loop control); that may turn out to be a more expensive operation than the array subscript. Not to mention we've introduced another variable that's not strictly necessary; we're trading a little space for what may or may not be an improvement in speed. It really depends on the compiler, the structure of the code, optimization settings, OS, and CPU.
Worry about correctness first, then worry about readability/maintainability, then worry about safety/reliability, then worry about performance. Unless you're failing to meet a hard performance requirement, focus on making your intent clear and easy to understand. It doesn't matter how fast your code is if it gives you the wrong answer or performs the wrong action, or if it crashes horribly at the first hint of bad input, or if you can't fix bugs or add new features without breaking something.
Yes.. when storing myarray[i] pointer it will perform better (if used on large scale...)
Why??
It will save you an addition and may be a multiplication (or a shift..)
Many compilers may optimize that for you in case of static memory allocation.
If you are using dynamic memory allocation, the compiler will not optimize it, because it is in runtime!