Member versus global array access performance

Member versus global array access performance - c++

Consider the following situation:
class MyFoo {
public:
MyFoo();
~MyFoo();
void doSomething(void);
private:
unsigned short things[10];
};
class MyBar {
public:
MyBar(unsigned short* globalThings);
~MyBar();
void doSomething(void);
private:
unsigned short* things;
};
MyFoo::MyFoo() {
int i;
for (i=0;i<10;i++) this->things[i] = i;
};
MyBar::MyBar(unsigned short* globalThings) {
this->things = globalThings;
};
void MyFoo::doSomething() {
int i, j;
j = 0;
for (i = 0; i<10; i++) j += this->things[i];
};
void MyBar::doSomething() {
int i, j;
j = 0;
for (i = 0; i<10; i++) j += this->things[i];
};
int main(int argc, char argv[]) {
unsigned short gt[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
MyFoo* mf = new MyFoo();
MyBar* mb = new MyBar(gt);
mf->doSomething();
mb->doSomething();
}
Is there an a priori reason to believe that mf.doSomething() will run faster than mb.doSomething()? Does that change if the executable is 100MB?

Because anything can modify your gt array, there may be some optimizations performed on MyFoo that are unavaible to MyBar (though, in this particular example, I don't see any)
Since gt lives locally (we used to call that the DATA segment, but I'm not sure if that still applies), while things lives in the heap (along with mf, and the other parts of mb) there may be some memory access & caching issues dealing with things. But, if you'd created mf locally (MyFoo mf = MyFoo()), then that would be an issue (i.e. things and gf would be on an equal footing in that regard)
The size of the executable should make any difference. The size of the data might, but for the most part, after the first access, both arrays will be in the CPU cache and there should be no difference.

There's little reason to believe one will be noticeably faster than the other. If gt (for example) was large enough for it to matter, you might get slightly better performance from:
int j = std::accumulate(gt, gt+10, 0);
With only 10 elements, however, a measurable difference seems quite unlikely.

MyFoo::DoSomething can be expected to be marginally faster than MyBar::DoSomething
This is because when things is stored locally in an array, we just need to dereference this to get to things and we can access the array immediately. When things is stored externally, we first need to dereference this and then we need to dereference things before we can access the array. So we have two load instructions.
I have compiled your source into assembler (using -O0) and the loop for MyFoo::DoSomething looks like:
jmp .L14
.L15:
movl -4(%ebp), %edx
movl 8(%ebp), %eax //Load this into %eax
movzwl (%eax,%edx,2), %eax //Load this->things[i] into %eax
movzwl %ax, %eax
addl %eax, -8(%ebp)
addl $1, -4(%ebp)
.L14:
cmpl $9, -4(%ebp)
setle %al
testb %al, %al
jne .L15
Now for DoSomething::Bar we have:
jmp .L18
.L19:
movl 8(%ebp), %eax //Load this
movl (%eax), %eax //Load this->things
movl -4(%ebp), %edx
addl %edx, %edx
addl %edx, %eax
movzwl (%eax), %eax //Load this->things[i]
movzwl %ax, %eax
addl %eax, -8(%ebp)
addl $1, -4(%ebp)
.L18:
cmpl $9, -4(%ebp)
setle %al
testb %al, %al
jne .L19
As can be seen from the above there is the double load. The problem may be compounded if this and this->things have a large difference in address. This they will then live in different cache pages and the CPU may have to do two pulls from main memory before this->things can be accessed. When they are part of the same object, when we get this we get this->things at the same time as this.
Caveate - the optimizer may be able to provide some shortcuts that I have not thought of though.

Most likely the extra dereference (of MyBar, which has to fetch the value of the member pointer) is meaningless performance-wise, especially if the data array is very large.

It could be somewhat slower. The question is simply how often you access. What you should consider is that your machine has a fixed cache. When MyFoo is loaded in to have DoSomething called on it, the processor can just load the whole array into cache and read it. However, in MyBar, the processor first must load the pointer, then load the address it points to. Of course, in your example main, they're all probably in the same cache line or close enough anyway, and for a larger array, the number of loads won't increase substantially with that one extra dereference.
However, in general, this effect is far from ignorable. When you consider dereferencing a pointer, that cost is pretty much zero compared to actually loading the memory it points to. If the pointer points to some already-loaded memory, then the difference is negligible. If it doesn't, you have a cache miss, which is very bad and expensive. In addition, the pointer introduces issues of aliasing, which basically means that your compiler can perform much less optimistic optimizations on it.
Allocate within-object whenever possible.

Related

What does the compiler do in assembly when optimizing code? ie -O2 flag

So when you add an optimization flag when compiling your C++, it runs faster, but how does this work? Could someone explain what really goes on in the assembly?

It means you're making the compiler do extra work / analysis at compile time, so you can reap the rewards of a few extra precious cpu cycles at runtime. Might be best to explain with an example.
Consider a loop like this:
const int n = 5;
for (int i = 0; i < n; ++i)
cout << "bleh" << endl;
If you compile this without optimizations, the compiler will not do any extra work for you -- assembly generated for this code snippet will likely be a literal translation into compare and jump instructions. (which isn't the fastest, just the most straightforward)
However, if you compile WITH optimizations, the compiler can easily inline this loop since it knows the upper bound can't ever change because n is const. (i.e. it can copy the repeated code 5 times directly instead of comparing / checking for the terminating loop condition).
Here's another example with an optimized function call. Below is my whole program:
#include <stdio.h>
static int foo(int a, int b) {
return a * b;
}
int main(int argc, char** argv) {
fprintf(stderr, "%d\n", foo(10, 15));
return 0;
}
If i compile this code without optimizations using gcc foo.c on my x86 machine, my assembly looks like this:
movq %rsi, %rax
movl %edi, -4(%rbp)
movq %rax, -16(%rbp)
movl $10, %eax ; these are my parameters to
movl $15, %ecx ; the foo function
movl %eax, %edi
movl %ecx, %esi
callq _foo
; .. about 20 other instructions ..
callq _fprintf
Here, it's not optimizing anything. It's loading the registers with my constant values and calling my foo function. But look if i recompile with the -O2 flag:
movq (%rax), %rdi
leaq L_.str(%rip), %rsi
movl $150, %edx
xorb %al, %al
callq _fprintf
The compiler is so smart that it doesn't even call foo anymore. It just inlines it's return value.

Most of the optimization happens in the compiler's intermediate representation before the assembly is generated. You should definitely check out Agner Fog's Software optimization resources. Chapter 8 of the 1st manual describes optimizations performed by the compiler with examples.

Atomic 64 bit writes with GCC

I've gotten myself into a confused mess regarding multithreaded programming and was hoping someone could come and slap some understanding in me.
After doing quite a bit of reading, I've come to the understanding that I should be able to set the value of a 64 bit int atomically on a 64 bit system1.
I found a lot of this reading difficult though, so thought I would try to make a test to verify this. So I wrote a simple program with one thread which would set a variable into one of two values:
bool switcher = false;
while(true)
{
if (switcher)
foo = a;
else
foo = b;
switcher = !switcher;
}
And another thread which would check the value of foo:
while (true)
{
__uint64_t blah = foo;
if ((blah != a) && (blah != b))
{
cout << "Not atomic! " << blah << endl;
}
}
I set a = 1844674407370955161; and b = 1144644202170355111;. I run this program and get no output warning me that blah is not a or b.
Great, looks like it probably is an atomic write...but then, I changed the first thread to set a and b directly, like so:
bool switcher = false;
while(true)
{
if (switcher)
foo = 1844674407370955161;
else
foo = 1144644202170355111;
switcher = !switcher;
}
I re-run, and suddenly:
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
What's changed? Either way I'm assigning a large number to foo - does the compiler handle a constant number differently, or have I misunderstood everything?
Thanks!
1: Intel CPU documentation, section 8.1, Guaranteed Atomic Operations
2: GCC Development list discussing that GCC doesn't guarantee it in the documentation, but the kernel and other programs rely on it

Disassembling the loop, I get the following code with gcc:
.globl _switcher
_switcher:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, -4(%rbp)
L2:
cmpl $0, -4(%rbp)
je L3
movq _foo#GOTPCREL(%rip), %rax
movl $-1717986919, (%rax)
movl $429496729, 4(%rax)
jmp L5
L3:
movq _foo#GOTPCREL(%rip), %rax
movl $1486032295, (%rax)
movl $266508246, 4(%rax)
L5:
cmpl $0, -4(%rbp)
sete %al
movzbl %al, %eax
movl %eax, -4(%rbp)
jmp L2
LFE2:
So it would appear that gcc does use to 32-bit movl instruction with 32-bit immediate values. There is an instruction movq that can move a 64-bit register to memory (or memory to a 64-bit register), but it does not seems to be able to set move an immediate value to a memory address, so the compiler is forced to either use a temporary register and then move the value to memory, or to use to movl. You can try to force it to use a register by using a temporary variable, but this may not work.
References:
mov
movq

http://www.x86-64.org/documentation/assembly.html
immediate values inside instructions remain 32 bits.
There is no way for the compiler to do the assignation of a 64 bits constant atomically, excepted by first filling a register and then moving that register to the variable. That is probably more costly than assigning directly to the variable and as atomicity is not required by the language, the atomic solution is not chosen.

The Intel CPU documentation is right, aligned 8 Bytes read/writes are always atomic on recent hardware (even on 32 bit operating systems).
What you don't tell us, are you using a 64 bit hardware on a 32 bit system? If so, the 8 byte write will most likely be splitted into two 4 byte writes by the compiler.
Just have a look at the relevant section in the object code.

gcc stack optimization

Hi I have a question on possible stack optimization by gcc (or g++)..
Sample code under FreeBSD (does UNIX variance matter here?):
void main() {
char bing[100];
..
string buffer = ....;
..
}
What I found in gdb for a coredump of this program is that the address
of bing is actually lower than that buffer (namely, &bing[0] < &buffer).
I think this is totally the contrary of was told in textbook. Could there
be some compiler optimization that re-organize the stack layout in such a
way?
This seems to be only possible explanation but I'm not sure..
In case you're interested, the coredump is due to the buffer overflow by
bing to buffer (but that also confirms &bing[0] < &buffer).
Thanks!

Compilers are free to organise stack frames (assuming they even use stacks) any way they wish.
They may do it for alignment reasons, or for performance reasons, or for no reason at all. You would be unwise to assume any specific order.
If you hadn't invoked undefined behavior by overflowing the buffer, you probably never would have known, and that's the way it should be.
A compiler can not only re-organise your variables, it can optimise them out of existence if it can establish they're not used. With the code:
#include <stdio.h>
int main (void) {
char bing[71];
int x = 7;
bing[0] = 11;
return 0;
}
Compare the normal assembler output:
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $80, %esp
movl %gs:20, %eax
movl %eax, 76(%esp)
xorl %eax, %eax
movl $7, (%esp)
movb $11, 5(%esp)
movl $0, %eax
movl 76(%esp), %edx
xorl %gs:20, %edx
je .L3
call __stack_chk_fail
.L3:
leave
ret
with the insanely optimised:
main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
Notice anything missing from the latter? Yes, there are no stack manipulations to create space for either bing or x. They don't exist. In fact, the entire code sequence boils down to:
set return code to 0.
return.

A compiler is free to layout local variables on the stack (or keep them in register or do something else with them) however it sees fit: the C and C++ language standards don't say anything about these implementation details, and neither does POSIX or UNIX. I doubt that your textbook told you otherwise, and if it did, I would look for a new textbook.

How do I declare an array created using malloc to be volatile in c++

I presume that the following will give me 10 volatile ints
volatile int foo[10];
However, I don't think the following will do the same thing.
volatile int* foo;
foo = malloc(sizeof(int)*10);
Please correct me if I am wrong about this and how I can have a volatile array of items using malloc.
Thanks.

int volatile * foo;
read from right to left "foo is a pointer to a volatile int"
so whatever int you access through foo, the int will be volatile.
P.S.
int * volatile foo; // "foo is a volatile pointer to an int"
!=
volatile int * foo; // foo is a pointer to an int, volatile
Meaning foo is volatile. The second case is really just a leftover of the general right-to-left rule.
The lesson to be learned is get in the habit of using
char const * foo;
instead of the more common
const char * foo;
If you want more complicated things like "pointer to function returning pointer to int" to make any sense.
P.S., and this is a biggy (and the main reason I'm adding an answer):
I note that you included "multithreading" as a tag. Do you realize that volatile does little/nothing of good with respect to multithreading?

volatile int* foo;
is the way to go. The volatile type qualifier works just like the const type qualifier. If you wanted a pointer to a constant array of integer you would write:
const int* foo;
whereas
int* const foo;
is a constant pointer to an integer that can itself be changed. volatile works the same way.

Yes, that will work. There is nothing different about the actual memory that is volatile. It is just a way to tell the compiler how to interact with that memory.

I think the second declares the pointer to be volatile, not what it points to. To get that, I think it should be
int * volatile foo;
This syntax is acceptable to gcc, but I'm having trouble convincing myself that it does anything different.
I found a difference with gcc -O3 (full optimization). For this (silly) test code:
volatile int v [10];
int * volatile p;
int main (void)
{
v [3] = p [2];
p [3] = v [2];
return 0;
}
With volatile, and omitting (x86) instructions which don't change:
movl p, %eax
movl 8(%eax), %eax
movl %eax, v+12
movl p, %edx
movl v+8, %eax
movl %eax, 12(%edx)
Without volatile, it skips reloading p:
movl p, %eax
movl 8(%eax), %edx ; different since p being preserved
movl %edx, v+12
; 'p' not reloaded here
movl v+8, %edx
movl %edx, 12(%eax) ; p reused
After many more science experiments trying to find a difference, I conclude there is no difference. volatile turns off all optimizations related to the variable which would reuse a subsequently set value. At least with x86 gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33). :-)

Thanks very much to wallyk, I was able to devise some code use his method to generate some assembly to prove to myself the difference between the different pointer methods.
using the code: and compiling with -03
int main (void)
{
while(p[2]);
return 0;
}
when p is simply declared as pointer, we get stuck in a loop that is impossible to get out of. Note that if this were a multithreaded program and a different thread wrote p[2] = 0, then the program would break out of the while loop and terminate normally.
int * p;
============
LCFI1:
movq _p(%rip), %rax
movl 8(%rax), %eax
testl %eax, %eax
jne L6
xorl %eax, %eax
leave
ret
L6:
jmp L6
notice that the only instruction for L6 is to goto L6.
==
when p is volatile pointer
int * volatile p;
==============
L3:
movq _p(%rip), %rax
movl 8(%rax), %eax
testl %eax, %eax
jne L3
xorl %eax, %eax
leave
ret
here, the pointer p gets reloaded each loop iteration and as a consequence the array item also gets reloaded. However, this would not be correct if we wanted an array of volatile integers as this would be possible:
int* volatile p;
..
..
int* j;
j = &p[2];
while(j);
and would result in the loop that would be impossible to terminate in a multithreaded program.
==
finally, this is the correct solution as tony nicely explained.
int volatile * p;
LCFI1:
movq _p(%rip), %rdx
addq $8, %rdx
.align 4,0x90
L3:
movl (%rdx), %eax
testl %eax, %eax
jne L3
leave
ret
In this case the the address of p[2] is kept in register value and not loaded from memory, but the value of p[2] is reloaded from memory on every loop cycle.
also note that
int volatile * p;
..
..
int* j;
j = &p[2];
while(j);
will generate a compile error.

Does the evil cast get trumped by the evil compiler?

This is not academic code or a hypothetical quesiton. The original problem was converting code from HP11 to HP1123 Itanium. Basically it boils down to a compile error on HP1123 Itanium. It has me really scratching my head when reproducing it on Windows for study. I have stripped all but the most basic aspects... You may have to press control D to exit a console window if you run it as is:
#include "stdafx.h"
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
char blah[6];
const int IAMCONST = 3;
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
(*pTOCONST) = 7;
printf("IAMCONST %d \n",IAMCONST);
printf("WHATISPOINTEDAT %d \n",(*pTOCONST));
printf("Address of IAMCONST %x pTOCONST %x\n",&IAMCONST, (pTOCONST));
cin >> blah;
return 0;
}
Here is the output
IAMCONST 3
WHATISPOINTEDAT 7
Address of IAMCONST 35f9f0 pTOCONST 35f9f0
All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example.
Update:
Indeed after searching for a while the Menu Debug >> Windows >> Disassembly had exactly the optimization that was described below.
printf("IAMCONST %d \n",IAMCONST);
0024360E mov esi,esp
00243610 push 3
00243612 push offset string "IAMCONST %d \n" (2458D0h)
00243617 call dword ptr [__imp__printf (248338h)]
0024361D add esp,8
00243620 cmp esi,esp
00243622 call #ILT+325(__RTC_CheckEsp) (24114Ah)
Thank you all!

Looks like the compiler is optimizing
printf("IAMCONST %d \n",IAMCONST);
into
printf("IAMCONST %d \n",3);
since you said that IAMCONST is a const int.
But since you're taking the address of IAMCONST, it has to actually be located on the stack somewhere, and the constness can't be enforced, so the memory at that location (*pTOCONST) is mutable after all.
In short: you casted away the constness, don't do that. Poor, defenseless C...
Addendum
Using GCC for x86, with -O0 (no optimizations), the generated assembly
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $3, -12(%ebp)
leal -12(%ebp), %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
movl $7, (%eax)
movl -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
call printf
movl -8(%ebp), %eax
movl (%eax), %eax
movl %eax, 4(%esp)
movl $.LC1, (%esp)
call printf
copies from *(bp-12) on the stack to printf's arguments. However, using -O1 (as well as -Os, -O2, -O3, and other optimization levels),
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $3, 4(%esp)
movl $.LC0, (%esp)
call printf
movl $7, 4(%esp)
movl $.LC1, (%esp)
call printf
you can clearly see that the constant 3 is used instead.
If you are using Visual Studio's CL.EXE, /Od disables optimization. This varies from compiler to compiler.
Be warned that the C specification allows the C compiler to assume that the target of any int * pointer never overlaps the memory location of a const int, so you really shouldn't be doing this at all if you want predictable behavior.

The constant value IAMCONST is being inlined into the printf call.
What you're doing is at best wrong and in all likelihood is undefined by the C++ standard. My guess is that the C++ standard leaves the compiler free to inline a const primitive which is local to a function declaration. The reason being that the value should not be able to change.
Then again, it's C++ where should and can are very different words.

You are lucky the compiler is doing the optimization. An alternative treatment would be to place the const integer into read-only memory, whereupon trying to modify the value would cause a core dump.

Writing to a const object through a cast that removes the const is undefined behavior - so at the point where you do this:
(*pTOCONST) = 7;
all bets are off.
From the C++ standard 7.1.5.1 (The cv-qualifiers):
Except that any class member declared mutable (7.1.1) can be modified, any attempt to modify a const
object during its lifetime (3.8) results in undefined behavior.
Because of this, the compiler is free to assume that the value of IAMCONST will not change, so it can optimize away the access to the actual storage. In fact, if the address of the const object is never taken, the compiler may eliminate the storage for the object altogether.
Also note that (again in 7.1.5.1):
A variable of non-volatile const-qualified integral or enumeration type initialized by an integral constant expression can be used in integral constant expressions (5.19).
Which means IAMCONST can be used in compile-time constant expressions (ie., to provide a value for an enumeration or the size of an array). What would it even mean to change that at runtime?

It doesn't matter if the compiler optimizes or not. You asked for trouble and you're lucky you got the trouble yourself instead of waiting for customers to report it to you.
"All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example."
If you really believe that then you need to switch to a language you can understand, or change professions. For the sake of yourself and your customers, stop using C or C++ or C#.
const int IAMCONST = 3;
You said it.
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
Guess why the compiler complained if you omitted your evil cast. The compiler might have been telling the truth before you lied to it.
"Does the evil cast get trumped by the evil compiler?"
No. The evil cast gets trumped by itself. Whether or not your compiler tried to tell you the truth, the compiler was not evil.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Member versus global array access performance - c++

There's little reason to believe one will be noticeably faster than the other. If gt (for example) was large enough for it to matter, you might get slightly better performance from: int j = std::accumulate(gt, gt+10, 0); With only 10 elements, however, a measurable difference seems quite unlikely.

Most likely the extra dereference (of MyBar, which has to fetch the value of the member pointer) is meaningless performance-wise, especially if the data array is very large.

Related

What does the compiler do in assembly when optimizing code? ie -O2 flag

Atomic 64 bit writes with GCC

gcc stack optimization

How do I declare an array created using malloc to be volatile in c++

Does the evil cast get trumped by the evil compiler?

Categories

Resources