Cost of dereferencing a variable on the stack, or dereferenced recently?

Cost of dereferencing a variable on the stack, or dereferenced recently? - c++

With a modern compiler, is it as expensive to dereference a pointer a second time, when the data it points to was dereferenced recently?
int * ptr = new int();
... lots of stuff...
*ptr = 1; // may need to load the memory into the cpu
*ptr = 2; // accessed again, can I assume this will usually be loaded and cost nothing extra?
What if the pointer addresses a variable on the stack, can I assume reading/writing through a pointer to a stack variable costs the same as reading/writing directly to the variable?
int var;
int * ptr = &var;
*ptr = 0; // will this cost the same as if I just said var = 0; ?
And finally, does this extend to more complicated things, such as manipulating a base object on the stack through an interface?
Base baseObject;
Derived * derivedObject = &baseObject;
derivedObject->value = 42; // will this have the same cost as if I just--
derivedObject->doSomething() // --manipulated baseObject directly?
Edit: I'm asking this to gain a deeper understanding; this is less a problem to be solved than it is a request for insight. Please don't worry about "premature-optimization" or other practical concerns, just give me all the rope you can :)

This question contains a number of ambiguities.
A simple rule of thumb is that dereferencing something will always have the same cost, except when it doesn't.
There are a number of factors in the cost of a dereference - is the destination in cache, is it paged and the code generated by the compiler.
For the code snippet
Obj* p = new Obj;
// <elided> //
p->something = 1;
looking at this source code we can't tell whether the executable will have ps value loaded, whether *p is in cache or whether *p has even been accessed.
Obj* p = new Obj;
p->something = 1;
We still can't be sure whether *p is paged/cached, but most modern compilers/optimizers will not emit code that retrieves p and stores it and then fetches it again.
In practice on modern hardware, you really shouldn't be concerned with it, and if you are, start by looking at the assembly.
I'll use two ends of the spectrum:
struct Obj { int something; int other; };
Obj* f() {
Obj* p = new Obj;
p->something = 1;
p->other = 2;
return p;
}
extern void fn2(Obj**);
Obj* h() {
Obj* p = new Obj;
fn2(&p);
p->something = 1;
fn2(&p);
p->other = 2;
return p;
}
This produces
f():
subq $8, %rsp
movl $8, %edi
call operator new(unsigned long)
movl $1, (%rax)
movl $2, 4(%rax)
addq $8, %rsp
ret
and
h():
subq $24, %rsp
movl $8, %edi
call operator new(unsigned long)
leaq 8(%rsp), %rdi
movq %rax, 8(%rsp)
call fn2(Obj**)
movq 8(%rsp), %rax
leaq 8(%rsp), %rdi
movl $1, (%rax)
call fn2(Obj**)
movq 8(%rsp), %rax
movl $2, 4(%rax)
addq $24, %rsp
ret
Here the compiler has to preserve and restore the pointer to dereference it after the call, but that's a bit unfair because the pointer could be modified by the called function.
Obj* h() {
Obj* p = new Obj;
fn2(nullptr);
p->something = 1;
fn2(nullptr);
p->other = 2;
return p;
}
produces
h():
pushq %rbx
movl $8, %edi
call operator new(unsigned long)
xorl %edi, %edi
movq %rax, %rbx
call fn2(Obj**)
xorl %edi, %edi
movl $1, (%rbx)
call fn2(Obj**)
movq %rbx, %rax
movl $2, 4(%rbx)
popq %rbx
ret
we're still seeing some register shenanigans, but it's hardly expensive.
As for your questions about pointers to the stack, a good optimizer will be able to eliminate those, but again you have to consult the assembly generated by your chosen compiler for your particular platform.
struct Obj { int something; int other; };
void fn(Obj*);
void f()
{
Obj o;
Obj* p = &o;
p->something = 1;
p->other = 1;
fn(p);
}
produces the following where p has basically been eliminated.
f():
subq $24, %rsp
movq %rsp, %rdi
movl $1, (%rsp)
movl $1, 4(%rsp)
call fn(Obj*)
addq $24, %rsp
ret
Of course, if we passed &p to something, the compiler wouldn't be able to elide it entirely, but it still might be smart enough to avoid using it when it didn't absolutely have to.

Trust the compiler.
Be pretty sure the compiler will generate code to do the less amount of work as possible, taking into account peculiarities of the CPU architecture, and anything the compiler can take into account.

With a modern compiler, is it as expensive to dereference a pointer a
second time, when the data it points to was dereferenced recently?
With full optimization, the compiler might be able to rearrange the code (depending on the code), or perhaps stash the pointer into a register, or perhaps stash the value in a register ... so maybe.
One might look at the generated assy code to confirm.
I considered this premature optimization.
Also, if you are thinking of cache, then possibly (but not guaranteed), when close together in time and when both mem addresses are in the same cache block, the two dereferences through the pointer will each access the cache mem without a cache miss.
Any writes will be placed in cache, and delivered to memory when the cache hw gets around to it or a cache miss causes the flush to memory.
What if the pointer addresses a variable on the stack, can I assume
reading/writing through a pointer to a stack variable costs the same
as reading/writing directly to the variable?
I doubt that you could or should assume anything. You might inspect the generated assy to see what your compiler did with this code on this target architecture and with this compiler version and build option choices, etc. Hundreds if not thousands of variables which might affect the code gen.
Note that data cache works for stack accesses, also.
Again, I consider this premature optimization.
And finally, does this extend to more complicated things, such as
manipulating a base object on the stack through an interface?
In general, the compiler does a good job. So, in that sense, possibly. Not guaranteed.
I think using move semantics (a C++ feature) is valuable, but that is maybe not relate-able to your questions here.
Hardware cache is probably more important than any amount of cycle counts you may wish to manually count (or simulate). I was impressed how much data cache (for automatic variables and dynamic variables) improved the performance on an embedded system. But the code cache was impressive too. I would not want to do with out either of them.
By premature optimization I mean that
a) humans are notoriously unable to understand (or 'guess') where the hot spots are in their programs, i.e. the 20% of the code consume 80% of cycles idea. That is why there are tools to help pin-point them.
b) I have always heard that the better algorithms outperform other choices, and I would say that is usually true. The better algorithms are what you should be learning. From your SO rep, you probably know more than I do.
However, I feel readability is the more appropriate criteria for evaluation. The feedback from former colleagues I like to hear is, "...we still use your code." because that statement suggests that it works, has not given them too much trouble, is fast enough, and (most important) readable.
c) counting cycles in any code can only be attempted with simulation. I've done it for an embedded military processor.
To simulate cache actions, you actually have to have usable code to evaluate, have to understand the interactions between processor and cache, and have to know cache block sizes of both data and instruction cache.
To choose the faster (as a criteria) is ... well if the first version meets the requirements, your customer (boss, team lead, etc.) will probably be unwilling to wait for / pay for faster code.
I've been on one big project to fix 'a system' that was deemed too slow (predecessor choices, not mine) ... management chose the safe to understand and cost estimate path: re-design the processor card with about 10x more processing cycles and 32x more ram. The software team refactored the code, and added the newest features as best we could. From overloaded, the new 'system' was running at 1/3 duty cycle. Big improvement that was visible in every command response time.

For Example 1 and 2, it depends on the context. Just doing stores if you are not using them, compiler will just ignore those. For the last example, it is a compilation error, you cannot point to Base object from Derived*
Check for yourself:
https://godbolt.org/g/pMiOfj
If Base extends, derived case (from example in comment):
Again the answer depends on how much information compiler knows. If it can see that it is just the object in stack, it will optimize and both of them
will be equivalent. https://godbolt.org/g/jIKQJv
Note: You don't have to worry about such details, Compiler knows better to optimize. Use the notion that is more readable. This is premature optimization.

Related

What are these "abundant" assembly codes used for and why are they compiled out?

I'm not familiar with the implementation of C++ compilers and I'm writing a C++ snippet like this (for learning):
#include <vector>
void vector8_inc(std::vector<unsigned char>& v) {
for (std::size_t i = 0; i < v.size(); i++) {
v[i]++;
}
}
void vector32_inc(std::vector<unsigned int>& v) {
for (std::size_t i = 0; i < v.size(); i++) {
v[i]++;
}
}
int main() {
std::vector<unsigned char> my{ 10,11,2,3 };
vector8_inc(my);
std::vector<unsigned int> my2{ 10,11,2,3 };
vector32_inc(my2);
}
By generating its assembly codes and disassemble its binary, I found the segments for
vector8_inc and vector32_inc:
g++ -S test.cpp -o test.a -O2
_Z11vector8_incRSt6vectorIhSaIhEE:
.LFB853:
.cfi_startproc
endbr64
movq (%rdi), %rdx
cmpq 8(%rdi), %rdx
je .L1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
addb $1, (%rdx,%rax)
movq (%rdi), %rdx
addq $1, %rax
movq 8(%rdi), %rcx
subq %rdx, %rcx // line C: calculate size of the vector
cmpq %rcx, %rax
jb .L3
.L1:
ret
.cfi_endproc
......
_Z12vector32_incRSt6vectorIjSaIjEE:
.LFB854:
.cfi_startproc
endbr64
movq (%rdi), %rax // line A
movq 8(%rdi), %rdx
subq %rax, %rdx // line C': calculate size of the vector
movq %rdx, %rcx
shrq $2, %rcx // Line B: ← What is this used for?
je .L6
addq %rax, %rdx
.p2align 4,,10
.p2align 3
.L8:
addl $1, (%rax)
addq $4, %rax
cmpq %rax, %rdx
jne .L8
.L6:
ret
.cfi_endproc
......
but in fact, these two functions are "inlined" within main(), and the instructions shown above will never be executed but compiled and loaded into memory (I verified this by adding a breakpoint to line A and checking the memory):
main() contains duplicate logic of the above segment
So here are my questions:
What are these two segments used for and why are they compiled out? _Z12vector32_incRSt6vectorIjSaIjEE, _Z11vector8_incRSt6vectorIhSaIhEE
Does the instruction (Line B) in _Z12vector32_incRSt6vectorIjSaIjEE (i.e. shrq $2, %rcx ) means 'to shift size integer of the vector right by 2 bits'? If it does, what's this used for?
Why is size() computed every turn of the loop in vector8_inc, but in vector32_inc it's computed before executing the loop? (see Line C and Line C', under -O2 optimization level of g++)

First, function names are a lot more readable if you demangle them, with a tool such as c++filt:
_Z12vector32_incRSt6vectorIjSaIjEE
-> vector32_inc(std::vector<unsigned int, std::allocator<unsigned int> >&)
_Z11vector8_incRSt6vectorIhSaIhEE
-> vector8_inc(std::vector<unsigned char, std::allocator<unsigned char> >&)
Q1
You defined those functions as global (external linkage). The compiler compiles one module at a time. It doesn't know whether this file is your entire program, or if you will link it with other object files later. In the latter case, some function from another module might call vector8_inc or vector32_inc, and so the compiler needs to create an out-of-line version to be available for such a call.
A traditional linker links code at the level of modules, and isn't able to selectively exclude functions from individual object files. It has to include your test.o module in the link, so that means the binary contains all the code from that module, including the out-of-line vector_inc functions. You know they can never be called, but the compiler didn't know that and the linker doesn't check.
Here are three ways you can avoid having them included in your binary:
Declare those functions as static in your source code. This ensures they cannot be called from other modules, so if the compiler can inline all calls from within this module, it need not emit the out-of-line versions.
Compile with -fwhole-program which tells GCC to assume that functions cannot be called from anywhere it can't see.
Link using -flto to enable link time optimization. This provides more data to the linker, so that it can be smart and do things like omit individual functions that are never called (and much more). When I used -flto, again I no longer saw the out-of-line code in the binary.
Q2
Shift right by 2 is division by 4. It looks like the std::vector class, in GCC's implementation, contains a start pointer (loaded into %rax) and a finish pointer (loaded into %rdx). Subtracting these pointers yields the number of bytes in the vector. For a vector of unsigned int, each element is 4 bytes in size, so dividing by 4 yields the number of elements.
This value is only used as a test of whether the vector is empty, by comparing it to zero. In this case the loop is skipped and the function does nothing. (shr will set the zero flag if the result is zero, and je jumps if the zero flag is set.) So the only way the shift matters, instead of just comparing the pointer difference to 0, is if the pointer difference were 1, 2 or 3. That should not happen, since pointers to unsigned int should be 4-byte aligned (unless the library is doing clever things like storing metadata in the low bits). Thus I think the shift is actually unnecessary, and this may be a minor missed optimization.
Q3
This is most likely to handle aliasing.
There is a general rule in C++ that, roughly speaking, writing through a pointer of one type may not modify an object of another type. So in vector32_inc, the compiler knows that writing to elements of the vector, through a pointer to unsigned int, cannot modify the vector's metadata (which apparently is objects of some other type, most likely pointers). Therefore it is safe to hoist the size() computation out of the loop, since it cannot change over the course of the loop's execution. Likewise, the pointer used to iterate over the object can be kept in a register throughout.
However, character types have a special exemption from this rule, partly so that things like memcpy can work. So when you do vec[i]++, writing through a pointer to unsigned char, the compiler has to account for the possibility that the pointer may actually point to some of the vector metadata. (This should not actually happen if vector is being used properly, but the compiler doesn't know that.) If it did, then vec[i]++ might modify that metadata in memory. The next loop iteration would then be supposed to use the newly modified metadata, so it has to be reloaded from memory just in case. Note that the vector's start pointer is also reloaded, and the new value used, instead of keeping it in a register throughout.

Is returning a private class member slower than using a struct and accessing that variable directly?

Suppose you have a class that has private members which are accessed a lot in a program (such as in a loop which has to be fast). Imagine I have defined something like this:
class Foo
{
public:
Foo(unsigned set)
: vari(set)
{}
const unsigned& read_vari() const { return vari; }
private:
unsigned vari;
};
The reason I would like to do it this way is because, once the class is created, "vari" shouldn't be changed anymore. Thus, to minimize bug occurrence, "it seemed like a good idea at the time".
However, if I now need to call this function millions of times, I was wondering if there is an overhead and a slowdown instead of simply using:
struct Foo
{
unsigned vari;
};
So, was my first impule right in using a class, to avoid anyone mistakenly changing the value of the variable after it has been set by the constructor?
Also, does this introduce a "penalty" in the form of a function call overhead. (Assuming I use optimization flags in the compiler, such as -O2 in GCC)?

They should come out to be the same. Remember that frustrating time you tried to use the operator[] on a vector and gdb just replied optimized out? This is what will happen here. The compiler will not create a function call here but it will rather access the variable directly.
Let's have a look at the following code
struct foo{
int x;
int& get_x(){
return x;
}
};
int direct(foo& f){
return f.x;
}
int fnc(foo& f){
return f.get_x();
}
Which was compiled with g++ test.cpp -o test.s -S -O2. The -S flag tells the compiler to "Stop after the stage of compilation proper; do not assemble (quote from the g++ manpage)." This is what the compiler gives us:
_Z6directR3foo:
.LFB1026:
.cfi_startproc
movl (%rdi), %eax
ret
and
_Z3fncR3foo:
.LFB1027:
.cfi_startproc
movl (%rdi), %eax
ret
as you can see, no function call was made in the second case and they are both the same. Meaning there is no performance overhead in using the accessor method.
bonus: what happens if optimizations are turned off? same code, here are the results:
_Z6directR3foo:
.LFB1022:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movl (%rax), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
and
_Z3fncR3foo:
.LFB1023:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi
call _ZN3foo5get_xEv #<<<call to foo.get_x()
movl (%rax), %eax
leave
.cfi_def_cfa 7, 8
ret
As you can see without optimizations, the sturct is faster than the accessor, but who ships code without optimizations?

You can expect identical performance. A great many C++ classes rely on this - for example, C++11's list::size() const can be expected to trivially return a data member. (Which contrasts with vector(), where the implementation's I've looked at calculate size() as the difference between pointer data member's corresponding to begin() and end(), ensuring typical iterator usage is as fast as possible at the cost of potentially slower indexed iteration, if the optimiser can't determine that size() is constant across loop iterations).
There's typically no particular reason to return by const reference for a type like unsigned that should fit in a CPU register anyway, but as it's inlined the compiler doesn't have to take that literally (for an out-of-line version it would likely be implemented by returning a pointer that has to be dereferenced). (The atypical reason is to allow taking the address of the variable, which is why say vector::operator[](size_t) const needs to return a const T& rather than a T, even if T is small enough to fit in a register.)

There is only one way to tell with certainty which one is faster in your particular program built with your particular tools with your particular optimisation flags on your particular platform — by measuring both variants.
Having said that, chances are good that the binaries will be identical, instruction for instruction.

As others have said, optimizers these days are relied on to boil out abstraction (especially in C++, which is more or less built to take advantage of that) and they're very, very good.
But you might not need the getter for this.
struct Foo {
Foo(unsigned set) : vari(set) {}
unsigned const vari;
};
const doesn't forbid initialization.

Class member that only calls member of other class?

given:
#include <stdio.h>
class A
{
friend class B;
private:
void func();
} GlobalA;
void A::func()
{
printf("A::func()");
}
class B
{
public:
void func();
};
void B::func()
{
GlobalA.func();
}
int main()
{
B b;
b.func();
getchar();
}
so really all B::func() does is call A::func(), is there a better way to do this? Or does the compiler just call A::func() directly when it compiles.
CONSTRAINTS:
class A creates threads and is used by multiple other classes. it is a global IO class to manage sockets/pipes, so i dont believe any type of inheritance would go over well.
NOTE: If this is a google-able problem please let me know as i did not what to search.

In fact B.func() does something more subtle:
It does not call A::func but GlobalA.func() , GlobalA is an instance of class A.
So here GlobalA is a singleton (but expressed in a very "raw" way of single global instance)
So wathever number of B instances you would create, they'll always call the same A instance (GlobalA).

Check out the generated assembly from your compiler (I used GCC 4.7 -O3):
For A::func()
_ZN1A4funcEv:
.LFB31:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
And B::func():
_ZN1B4funcEv:
.LFB32:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
They're identical - the compiler has done the smart thing for you for free behind the scenes. In this case your example was all in the same translation unit which makes it trivial for the compiler to decide if it's worth doing like that. (There are cases where it wouldn't be worth it most likely and the compiler will have a pretty good set of heuristics to help it figure out what's best for any given target).
If they were in different translation units it becomes quite a lot harder to do. Some compilers will still manage the same optimisations but not all. You can of course ensure that it remains within the same translation unit for every case by defining functions like these as inline which lets you specify them in the header files.
The moral of the story is don't sweat over tiny details - writing code that makes sense and is maintainable is far more important.

This is a common pattern: namely a bridge.
From my experience it is always inlined (g++ since version 4.3 at least).
See Flexo's answer, the call to A member function is indeed inlined in the sample code you posted.

Atomic 64 bit writes with GCC

I've gotten myself into a confused mess regarding multithreaded programming and was hoping someone could come and slap some understanding in me.
After doing quite a bit of reading, I've come to the understanding that I should be able to set the value of a 64 bit int atomically on a 64 bit system1.
I found a lot of this reading difficult though, so thought I would try to make a test to verify this. So I wrote a simple program with one thread which would set a variable into one of two values:
bool switcher = false;
while(true)
{
if (switcher)
foo = a;
else
foo = b;
switcher = !switcher;
}
And another thread which would check the value of foo:
while (true)
{
__uint64_t blah = foo;
if ((blah != a) && (blah != b))
{
cout << "Not atomic! " << blah << endl;
}
}
I set a = 1844674407370955161; and b = 1144644202170355111;. I run this program and get no output warning me that blah is not a or b.
Great, looks like it probably is an atomic write...but then, I changed the first thread to set a and b directly, like so:
bool switcher = false;
while(true)
{
if (switcher)
foo = 1844674407370955161;
else
foo = 1144644202170355111;
switcher = !switcher;
}
I re-run, and suddenly:
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
What's changed? Either way I'm assigning a large number to foo - does the compiler handle a constant number differently, or have I misunderstood everything?
Thanks!
1: Intel CPU documentation, section 8.1, Guaranteed Atomic Operations
2: GCC Development list discussing that GCC doesn't guarantee it in the documentation, but the kernel and other programs rely on it

Disassembling the loop, I get the following code with gcc:
.globl _switcher
_switcher:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, -4(%rbp)
L2:
cmpl $0, -4(%rbp)
je L3
movq _foo#GOTPCREL(%rip), %rax
movl $-1717986919, (%rax)
movl $429496729, 4(%rax)
jmp L5
L3:
movq _foo#GOTPCREL(%rip), %rax
movl $1486032295, (%rax)
movl $266508246, 4(%rax)
L5:
cmpl $0, -4(%rbp)
sete %al
movzbl %al, %eax
movl %eax, -4(%rbp)
jmp L2
LFE2:
So it would appear that gcc does use to 32-bit movl instruction with 32-bit immediate values. There is an instruction movq that can move a 64-bit register to memory (or memory to a 64-bit register), but it does not seems to be able to set move an immediate value to a memory address, so the compiler is forced to either use a temporary register and then move the value to memory, or to use to movl. You can try to force it to use a register by using a temporary variable, but this may not work.
References:
mov
movq

http://www.x86-64.org/documentation/assembly.html
immediate values inside instructions remain 32 bits.
There is no way for the compiler to do the assignation of a 64 bits constant atomically, excepted by first filling a register and then moving that register to the variable. That is probably more costly than assigning directly to the variable and as atomicity is not required by the language, the atomic solution is not chosen.

The Intel CPU documentation is right, aligned 8 Bytes read/writes are always atomic on recent hardware (even on 32 bit operating systems).
What you don't tell us, are you using a 64 bit hardware on a 32 bit system? If so, the 8 byte write will most likely be splitted into two 4 byte writes by the compiler.
Just have a look at the relevant section in the object code.

Member versus global array access performance

Consider the following situation:
class MyFoo {
public:
MyFoo();
~MyFoo();
void doSomething(void);
private:
unsigned short things[10];
};
class MyBar {
public:
MyBar(unsigned short* globalThings);
~MyBar();
void doSomething(void);
private:
unsigned short* things;
};
MyFoo::MyFoo() {
int i;
for (i=0;i<10;i++) this->things[i] = i;
};
MyBar::MyBar(unsigned short* globalThings) {
this->things = globalThings;
};
void MyFoo::doSomething() {
int i, j;
j = 0;
for (i = 0; i<10; i++) j += this->things[i];
};
void MyBar::doSomething() {
int i, j;
j = 0;
for (i = 0; i<10; i++) j += this->things[i];
};
int main(int argc, char argv[]) {
unsigned short gt[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
MyFoo* mf = new MyFoo();
MyBar* mb = new MyBar(gt);
mf->doSomething();
mb->doSomething();
}
Is there an a priori reason to believe that mf.doSomething() will run faster than mb.doSomething()? Does that change if the executable is 100MB?

Because anything can modify your gt array, there may be some optimizations performed on MyFoo that are unavaible to MyBar (though, in this particular example, I don't see any)
Since gt lives locally (we used to call that the DATA segment, but I'm not sure if that still applies), while things lives in the heap (along with mf, and the other parts of mb) there may be some memory access & caching issues dealing with things. But, if you'd created mf locally (MyFoo mf = MyFoo()), then that would be an issue (i.e. things and gf would be on an equal footing in that regard)
The size of the executable should make any difference. The size of the data might, but for the most part, after the first access, both arrays will be in the CPU cache and there should be no difference.

There's little reason to believe one will be noticeably faster than the other. If gt (for example) was large enough for it to matter, you might get slightly better performance from:
int j = std::accumulate(gt, gt+10, 0);
With only 10 elements, however, a measurable difference seems quite unlikely.

MyFoo::DoSomething can be expected to be marginally faster than MyBar::DoSomething
This is because when things is stored locally in an array, we just need to dereference this to get to things and we can access the array immediately. When things is stored externally, we first need to dereference this and then we need to dereference things before we can access the array. So we have two load instructions.
I have compiled your source into assembler (using -O0) and the loop for MyFoo::DoSomething looks like:
jmp .L14
.L15:
movl -4(%ebp), %edx
movl 8(%ebp), %eax //Load this into %eax
movzwl (%eax,%edx,2), %eax //Load this->things[i] into %eax
movzwl %ax, %eax
addl %eax, -8(%ebp)
addl $1, -4(%ebp)
.L14:
cmpl $9, -4(%ebp)
setle %al
testb %al, %al
jne .L15
Now for DoSomething::Bar we have:
jmp .L18
.L19:
movl 8(%ebp), %eax //Load this
movl (%eax), %eax //Load this->things
movl -4(%ebp), %edx
addl %edx, %edx
addl %edx, %eax
movzwl (%eax), %eax //Load this->things[i]
movzwl %ax, %eax
addl %eax, -8(%ebp)
addl $1, -4(%ebp)
.L18:
cmpl $9, -4(%ebp)
setle %al
testb %al, %al
jne .L19
As can be seen from the above there is the double load. The problem may be compounded if this and this->things have a large difference in address. This they will then live in different cache pages and the CPU may have to do two pulls from main memory before this->things can be accessed. When they are part of the same object, when we get this we get this->things at the same time as this.
Caveate - the optimizer may be able to provide some shortcuts that I have not thought of though.

Most likely the extra dereference (of MyBar, which has to fetch the value of the member pointer) is meaningless performance-wise, especially if the data array is very large.

It could be somewhat slower. The question is simply how often you access. What you should consider is that your machine has a fixed cache. When MyFoo is loaded in to have DoSomething called on it, the processor can just load the whole array into cache and read it. However, in MyBar, the processor first must load the pointer, then load the address it points to. Of course, in your example main, they're all probably in the same cache line or close enough anyway, and for a larger array, the number of loads won't increase substantially with that one extra dereference.
However, in general, this effect is far from ignorable. When you consider dereferencing a pointer, that cost is pretty much zero compared to actually loading the memory it points to. If the pointer points to some already-loaded memory, then the difference is negligible. If it doesn't, you have a cache miss, which is very bad and expensive. In addition, the pointer introduces issues of aliasing, which basically means that your compiler can perform much less optimistic optimizations on it.
Allocate within-object whenever possible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js