Dot operator cost c/c++ [closed]

Dot operator cost c/c++ [closed] - c++

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
We all know about -> vs . speed diff to access members in c/c++, but I am hard to find any clues of the actual cost of the simple dot operator.
I imagine its something like address-of-struct + offset, also assume the offset being the sum of all sizeof-s of all preceding members. Is this (roughly) correct?
Then compared to -> who much faster it is? Twice?
(having seen some asm, here on SO, about . access being one instruction, I guess there is some magic about it)
Also, how much slower is, compared to local variable?
Thank You
EDIT:
I failed to ask it correctly, I guess.
To try to clear things up:
By "-> vs ." I meant "using pointer to access the struct" vs "direct member access" - (link).
And then I was just curious: "Ok, and what about the dot access itself?It sourly cost something." So I asked the question.
"Dot operator cost c/c++" itself might be absurd/nonsense/naive question, still it does get the answers I was looking for. Can't say it better now.
Thanks

We all know about -> vs . speed diff to access members in c/c++, but I am hard to find any clues of the actual cost of the simple dot operator.
The "we all" apparently doesn't include me. I'm not aware of any significant difference (especially in C++) between -> vs. ..
I imagine its something like address-of-struct + offset, also assume the offset being the sum of all sizeof-s of all preceding members. Is this (roughly) correct?
Yes.
Then compared to -> how much faster it is? Twice? (having seen some asm, here on SO, about . access being one instruction, I guess there is some magic about it)
Both -> and . involve calculation of an effective address and that is the most expensive operation (beside the actual memory access). If pointer (on the left of ->) is used very often (e.g. this) then it is highly likely to be already cached by compiler in a CPU register, effectively negating any possible difference between -> and ..
Well, this is a pointer, everything belonging the object inside a method is effectively prefixed with this->, yet C++ programs haven't slowed to a crawl.
Obviously, if . is applied to a reference, then it is 100% equivalent of ->.
Also, how much slower is it, compared to local variable?
Hard to evaluate. Essentially difference in meta assembler would be: two asm ops to access the local variable (add the offset of the variable on the stack to the stack pointer; access the value) vs. three asm ops to access attribute of an object via pointer (load the pointer to the object; add the offset; access the value). But due to compiler optimizations the difference is rarely noticeable.
Often it is difference between local and global variables which is standing out: address of local variable/object's attribute has to be computed, while all global variables have unique global address calculated at link time.
But overhead of the field/attribute access is really negligible and minuscule compared to e.g. overhead of a system call.

Any decent compiler will calculate the address of struct field at compile time so the cost of . should be zero.
In other words access to struct field using . is as costly as access to variable.

It depends on a huge number of things.
The . operator can vary from being cheaper than accessing a local variable or using -> to being much more expensive than either.
This isn't a sensible question to ask.

I would say the difference of cost isn't in the operators themselves, but in the cost of accessing the left-side object.
ptr->variable
should produce a similar asm output than
(*ptr).variable // yeah I am using a '.' because it's faster...
therefore your question is kind of nonsensical as it is.
I think I understand what you meant though, so I'll try to answer that instead.
The operators themselves are nearly costless. They only involve computing an address. This address is expressed as the address of the object plus an offset (likely fixed at compile time and therefore a constant in the program for each field). The real cost comes from actually fetching the bytes at this address.
Now, I don't understand why it would be more costly to use a -> vs a . since they effectively do the same thing, there can be a difference of access though in this case:
struct A { int x; };
void function(A& external)
{
A local;
external.x = local.x;
}
In this case, accessing external.x is likely to be more costly because it necessitates accessing memory outside of the scope of the function and therefore the compiler cannot know in advance whether or not the memory will have already been fetched and put in the processor cache (or a register) etc...
On the other hand local.x being local and stored on the stack (or a register), the compiler may optimize away the fetch part of the code and access local.x directly.
But as you notice there is no difference between using -> vs ., the difference is in using a local variable (stored on the stack) vs using a pointer / reference to an externally supplied object, on which the compiler cannot make an assumption.
Finally, it's important to note that if the function was inlined, then the compiler could optimize its use at the caller's site and effectively use a slightly different implementation there, but don't try to inline everything anyway: first of all the compiler will probably ignore your hints, and if you really force it you may actually lose performance.

Related

Use local variables or access multiple times a struct value ( C++ )

in JS it a good practice to create a variable, for reusing, instead of access the value in a deep object structure:
for (var i = 0, l = arr.length; i < l; ++i) {
var someValueINeedOftenHere = arr[i].path.to.value;
// do several things with this var..
}
So instead of finding the value in this deep object structure, we store it locally and we can reuse it over and over again. This should be a good practice, not only because it lets you write cleaner code, but also because of performance.
So, when I am writing C++ code, and I've to iterate over a vector, which holds a lot of structs / objects. Is then the same, or it doesn't matter?

Generally speaking, in C/C++ it doesn't matter. In C and C++, the memory layout of every structure is known at compile-time. When you type arr[i].path.to.value, that's going to be essentially the same as *(&arr[0] + i * (something) + offset_1 + offset_2 + offset_3), and all that will get simplified at compile-time to something like *(&arr[0] + i * (something) + something). And those something's will be computed by the compiler and hard-coded into the binary, so effectively looking up arr[i].path.to is not faster than arr[i].path.to.value.
This is not mandated by the standard or anything as far as I know, but it's how most compilers will actually work.
If you want to be sure in some specific case, you can look at godbolt and see what assembly it cooks up: http://gcc.godbolt.org/
Note that I'm assuming that when you make the local variable, you are taking a reference to the value arr[i].path.to.value, which is most similar to what you do in javascript. If you actually copy the value to a new variable then that will create some copying overhead. I don't think that copying it would be advantageous w.r.t. cache misses unless the usage pattern is pretty complicated. Once you access arr[i].path.to.value once, all the stuff around it is going to be in the cache, and there's no reason that copying it onto the stack would make anything faster.

It doesn't matter if none of arr, arr[i], path, to, value are/involve references/pointers, as pointed out by Chris Beck.
However, if any of arr, arr[i], path, to, value is/involves a reference/pointer then accessing them will possibly be a cache miss. Accessing them many times makes it multiple cache misses (probably).
Storing a reference/pointer directly to value might therefore be more efficient than getting multiple cache misses while chasing the pointer(s) to get to value. Bear in mind the compiler will probably optimize things like this anyway.
Storing value as a copy might be more efficient if it avoids cache misses, and it's relatively light to copy, and you don't need to modify the original value. Again, the compiler will most probably optimize these cases as well if it's clear that doing so will give a performance improvement.
It all depends. Only optimize if it's been proven to be a problem.

Performance impact of objects

I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?

C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.

On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).

It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.

Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.

Understanding stack frame of function call in C/C++? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am new to C/C++ and assembly lang as well.
This could also be very basic question.
I am trying to understand how stack frames are built and which variables(params) are pushed to stack in what order?.
Some search results showed that....compiler of C/C++ decides based on operations performed within a function. for e.g if the function was suppose to just increment value by 1 of the passed int param and return (similar to ++ operator) it would put all ..the param of the function and local variable within the function in registers and perform addition ....wondering which register is used for returned/pass by value ?....how are references returned? .....difference b/w eax, ebx,ecx and edx.
Requesting for a book/blog/link or any kind of material to understand registers,stack and heap references are used/built and destroyed during function call's....and also how main function is stored?
Thanks In Advance

Your question is borderline here. programmers could be a better place.
A good book to understand the concepts of stack etc might be Queinnec's Lisp In Small Pieces (it explains quite well what a stack is for Lisp). Also, SICP
is a good book to read.
D.Knuth's books and MMIX is also a good read.
Read carefully Wikipedia Call stack page.
In theory, no call stack is needed, and some languages and implementations (e.g. old SML/NJ) did not use any stack (but allocated the call frame in the garbage collected heap). See A.Appel's old paper Garbage Collection Can be Faster than Stack Allocation (and learn more about garbage collection in general).
Usually C and C++ implementations have a stack (and often use the hardware stack). Some C local variables might not have any stack location (because they have been optimized, or are kept in a register). Sometimes, the stack location of a C local variable may change (the compiler would use one call stack slot for some occurrences, and another call stack slot for other occurrences of the same local variable). And of course some temporary values may be compiled like your local variables (so stay in a register, on in one stack slot then another one, etc....). When optimizing the compiler could do weird tricks with variables.
On some old machines IBM/360 or IBM z/series, there is no hardware stack; the stack used by the C compiler is a software convention (e.g. some register is dedicated to that usage, without specific hardware support)
Think about the execution (or interpretation) of a recursively defined function (like the good old factorial naively coded). Read about recursion (in general, in computer science), primitive recursive functions, lambda calculus, denotational semantics, stack automaton, register allocation, tail calls, continuations, ABI, interrupts, Posix signals, sigaltstack(2), getcontext(2), longjmp(3)etc.... etc...
Read also books about Computer Architecture. In practice, the call stack is so important that several hardware resources (including the stack pointer register, often the call frame base pointer register, and perhaps hidden machinery e.g. cache related) are dedicated to it on common processors.
You could also look at the intermediate representations used by the GCC compiler. Then use -fdump-tree-all or the GCC MELT probe. If looking at the generated assembly be sure to pass -S -fverbose-asm to your gcc command.
See also the linux assembly howto.
I gave a lot of links. It is difficult to answer better, because I have no idea of your background.

I am trying to understand how stack frames are built and which
variables(params) are pushed to stack in what order?
THis is dependent on the architecture of the processor. However, typically, the stack grows from a high address towards a lower address (if we look at memory addressses as numeric values). One stackframe is "whatever this function puts on the stack"
The "stuff" that gets put on the stack typically is:
Return address back to the calling function.
A frame-pointer, pointing to the stack-frame at the start of the call.
Saved registers that need to be "preserved" for when this function returns.
Local variables.
Parameters to the "next" function in the call-stack.
compiler of C/C++ decides based on operations performed within a
function. for e.g if the function was suppose to just increment value
by 1 of the passed int param and return (similar to ++ operator) it
would put all ...
the param of the function and local variable within the function in registers
and perform addition ....wondering which register is used for
returned/pass by value ?....how are references returned?
The compiler has rules for how parameters are passed, and for regular function calls [that is, not "inlined" functions], the parameters are always passed in the same order, in the same combination of registers and stack-memory. If that wasn't the case, the compiler would have to know exactly what the function does before it could decide to pass the arguments.
Different processor architectures have different rules. x86-32 typically has one or two registers used for input parameters, and typically one register for the return values. x86-64 used up to 5 registers for passing the first five values to the function. Any further arguments are passed in registers.
Returning a reference is no different from returning any other value. The value (which in this case is the address of the object being returned). In x86-32, return values are in EAX. In x86-64, return values are in RAX. In ARM, R0 is used for the return value. In 29K, R96 is used for the return value.

Reasons to not pass simple types by reference?

So far as I understand you should not pass simple types by reference in c++ because it does not improve perfomance, it's even bad for performance(?). At least thats what I managed to gather from the net.
But I can't find out the reason why it's bad for performance, is it because it's quicker for c++ to just create a new simple type than it it is too look up variable or what is it?

If you create a reference, it's:
pointer to memory location -> memory location
If you use a value, it's:
memory location
Since a value has to be copied either way (the reference or the value), passing by reference does not improve performance; one extra lookup has to be made. So it theoretically "worsens" performance, but not by any amount you'll ever notice.

It's a good question and the fact that you're asking it shows that you're paying attention to your code. However, the good news is that in this particular case, there's an easy way out.
This excellent article by Dave Abrahms answers all your questions and then some: Want Speed? Pass by Value.
Honestly, it does not do the link justice to summarize it, it's a real must-read. However, to make a long story short, your compiler is smart enough to do it correctly and if you try doing it manually, you may prevent your compiler from doing certain optimizations.

Internally references are treated at pointers, which need to be dereferenced each time you access them. That obviously requires additional operations.

For example, in 32 bit mode, int size is 4 bytes and int * pointer size is 4 bytes.
Passing 4 bytes int directly is quicker, than passing 4 byte pointer and then loading int by that pointer.

Pointer vs Variable speed in C++

At a job interview I was asked the question "In C++ how do you access a variable faster, though the normal variable identifier or though a pointer". I must say I did not have a good technical answer to the question so I took a wild guess.
I said that access time will probably be that same as normal variable/identifier is a pointer to the memory address where the value is stored, just like a pointer. In other words, that in terms of speed they both have the same performance, and that pointers are only different because we can specify the memory address we want them to point to.
The interviewer did not seem very convinced/satisfied with my answer (although he did not say anything, just carried on asking something else), therefore I though to come and ask SO'ers wether my answer was accurate, and if not why (from a theory and technical POV).

When you access a "variable", you look up the address, and then fetch the value.
Remember - a pointer IS a variable. So actually, you:
a) look up the address (of the pointer variable),
b) fetch the value (the address stored at that variable)
... and then ...
c) fetch the value at the address pointed to.
So yes, accessing via "pointer" (rather than directly) DOES involve (a bit) of extra work and (slightly) longer time.
Exactly the same thing occurs whether or not it's a pointer variable (C or C++) or a reference variable (C++ only).
But the difference is enormously small.

A variable does not have to live in main memory. Depending on the circumstances, the compiler can store it in a register for all or part of its life, and accessing a register is much faster than accessing RAM.

Let's ignore optimization for a moment, and just think about what the abstract machine has to do to reference a local variable vs. a variable through a (local) pointer. If we have local variables declared as:
int i;
int *p;
when we reference the value of i, the unoptimized code has to go get the value that is (say) at 12 past the current stack pointer and load it into a register so we can work with it. Whereas when we reference *p, the same unoptimized code has to go get the value of p from 16 past the current stack pointer, load it into a register, and then go get the value that the register points to and load it into another register so we can work with it as before. The first part of the work is the same, but the pointer access conceptually involves an additional step that needs to be done before we can work with the value.
That was, I think, the point of the interview question - to see if you understood the fundamental difference between the two types of access. You were thinking that the local variable access involved a kind of lookup, and it does - but the pointer access involves that very same type of lookup to get to the value of the pointer before we can start to go after the thing it is pointing to. In simple, unoptimized terms, the pointer access is going to be slower because of that extra step.
Now with optimization, it may happen that the two times are very close or identical. It is true that if other recent code has already used the value of p to reference another value, you may already find p in a register, so that the lookup of *p via p takes the same time as the lookup of i via the stack pointer. By the same token, though, if you have recently used the value of i, you may already find it in a register. And while the same might be true of the value of *p, the optimizer can only reuse its value from the register if it is sure that p hasn't changed in the mean time. It has no such problem reusing the value of i. In short, while accessing both values may take the same time under optimization, accessing the local variable will almost never be slower (except in really pathological cases), and may very well be faster. That makes it the correct answer to the interviewer's question.
In the presence of memory hierarchies, the difference in time may get even more pronounced. Local variables are going to be located near each other on the stack, which means that you are very likely to find the address you need already in main memory and in the cache the first time you access it (unless it is the very first local variable you access in this routine). There is no such guarantee with the address the pointer points to. Unless it was recently accessed, you may need to wait for a cache miss, or even a page fault, to access the pointed-to address, which could make it slower by orders of magnitude vs. the local variable. No, that won't happen all the time - but it's a potential factor that could make a difference in some cases, and that too is something that could be brought up by a candidate in response to such a question.
Now what about the question other commenters have raised: how much does it matter? It's true, for a single access, the difference is going to be tiny in absolute terms, like a grain of sand. But you put enough grains of sand together and you get a beach. And though (to continue the metaphor) if you are looking for someone who can run quickly down a beach road, you don't want someone who will obsess about sweeping every grain of sand off the road before he or she can start running, you do want someone who will be aware when he or she is running through knee-deep dunes unnecessarily. Profilers won't always rescue you here - in these metaphorical terms, they are much better at recognizing a single big rock that you need to run around than noticing lots of little grains of sand that are bogging you down. So I would want people on my team who understand these issues at a fundamental level, even if they rarely go out of their way to use that knowledge. Don't stop writing clear code in the quest for microoptimization, but be aware of the kinds of things that can cost performance, especially when designing your data structures, and have a sense of whether you are getting good value for the price you are paying. That's why I think this was a reasonable interview question, to explore the candidate's understanding of these issues.

What paulsm4 and LaC said + a little asm:
int y = 0;
mov dword ptr [y],0
y = x;
mov eax,dword ptr [x] ; Fetch x to register
mov dword ptr [y],eax ; Store it to y
y = *px;
mov eax,dword ptr [px] ; Fetch address of x
mov ecx,dword ptr [eax] ; Fetch x
mov dword ptr [y],ecx ; Store it to y
Not that on the other hand it matters much, also this probably is harder to optimize (fe. you can't keep the value in cpu register, as the pointer just points to some place in memory). So optimized code for y = x; could look like this:
mov dword ptr [y], ebx - if we assume that local var x was stored in ebx

I think the interviewer was looking for you to mention the word register. As in, if you declare a variable as a register variable the compiler will do its utmost to ensure that it is stored in a register on the CPU.
A bit of chat around bus access and negotiation for other types of variables and pointers alike would have helped to frame it.

paulsm4 and LaC has already explained it nicely along with other members. I want to emphasize effect of paging when the pointer is pointing to something in heap which has been paged out.
=> Local variables are available either in the stack or in the register => while in case of pointer, the pointer may be pointing to an address which is not in cache and paging will certainly slow down the speed.

A variable holds a value of certain type, and accessing the variable means getting this value, from memory or from a register. When getting the value from memory we need to get it's address from somewhere - most of the time it has to be loaded into a register (sometimes it can be part of the load command itself, but this is quite rare).
A pointer keeps an address of a value; this value has to be in memory, the pointer itself can be in memory or in a register.
I would expect that on average access via a pointer will be slower than accessing the value through a variable.

Your analysis ignores the common scenario in which the pointer itself is a memory variable which must also be accessed.
There are many factors that affect the performance of software, but if you make certain simplifying assumptions about the variables involved (notably that they are not cached in any way), then each level of pointer indirection requires an additional memory access.
int a = 1234; // luggage combination
int *b = &a;
int **c = &b;
...
int e = a; // one memory access
int e = *b; // two memory accesses
int e = **c; // three memory accesses
So the short answer to "which is faster" is: ignoring compiler and processor optimizations which might be occurring, it is faster to access the variable directly.
In a best-case scenario, where this code is executed repeatedly in a tight loop, the pointer value would likely be cached into a CPU register or at worst into the processor's L1 cache. In such a case, it is likely that a first-level pointer indirection is as fast or faster than accessing the variable directly since "directly" probably means via the "stack pointer" register (plus some offset). In both cases you are using a CPU register as a pointer to the value.
There are other scenarios that could affect this analysis, such as for global or static data where the variable's address is hard-coded into the instruction stream. In such a scenario, the answer may depend on the specifics of the processor involved.

I think the key part of the question is "access a variable". To me, if a variable is in scope, why would you create a pointer to it (or a reference) to access it? Using a pointer or a reference would only make sense if the variable was in itself a data structure of some sort or if you were acessing it in some non-standard way (like interpreting an int as a float).
Using a pointer or a reference would be faster only in very specific circumstances. Under general circumstances, it seems to me that you would be trying to second guess the compiler as far as optimization is concerned and my experience tells me that unless you know what you're doing, that's a bad idea.
It would even depend on the keyword. A const keyword might very well mean that the variable is totally optimized out at compile time. That is faster than a pointer. The register keyword does not guarantee that the variable is stored in a register. So how do you know whether its faster or not? I think the answer is that it depends because there is no one size fits all answer.

I think a better answer might be it depends on where the pointer is 'pointing to'. Note, a variable might already be in the cache. However a pointer might incur a fetch penalty. It's similar to a linked list vs Vector performance tradeoff. A Vector is cache friendly because all of your memory is contigious. However a linked list, since it contains pointers, might incur a cache penalty because the memory is potentially scattered all over the place

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js