Is dereferencing a pointer notabley slower than just accessing that value directly? I suppose my question is - how fast is the deference operator?
Going through a pointer indirection can be much slower because of how a modern CPU works. But it has nothing much to do with runtime memory.
Instead, speed is affected by prediction and cache.
Prediction is easy when the pointer has not been changed or when it is changed in predictable ways (for example, increment or decrement by four in a loop). This allows the CPU to essentially run ahead of the actual code execution, figure out what the pointer value is going to be, and load that address into cache. Prediction becomes impossible when the pointer value is built by a complex expression like a hash function.
Cache comes into play because the pointer might point into memory that isn't in cache and it will have to be fetched. This is minimized if prediction works but if prediction is impossible then in the worst case you can have a double impact: the pointer is not in cache and the pointer target is not in cache either. In that worst-case the CPU would stall twice.
If the pointer is used for a function pointer, the CPU's branch predictor comes into play. In C++ virtual tables, the function values are all constant and the predictor has it easy. The CPU will have the code ready to run and in the pipeline when execution goes through the indirect jump. But, if it is an unpredictable function pointer the performance impact can be heavy because the pipeline will need to be flushed which wastes 20-40 CPU cycles with each jump.
Depends on stuff like:
whether the "directly accessed" value is in a register already, or on the stack (that's also a pointer indirection)
whether the target address is in cache already
the cache architecture, bus architecture etc.
ie, too many variables to usefully speculate about without narrowing it down.
If you really want to know, benchmark it on your specific hardware.
it requires a memory access more:
read the address stored into the pointer variable
read the value at the address read
This could not be equal to 2 simple operation, because it may require also more time due to access an address not already loaded in the cache.
Assuming you're dealing with a real pointer (not a smart pointer of some sort), the dereference operation doesn't consume (data) memory at all. It does (potentially) involve an extra memory reference though: one to load the pointer itself, the other to access the data pointed to by the pointer.
If you're using a pointer in a tight loop, however, it'll normally be loaded into a register for the duration. In this case, the cost is mostly in terms of extra register pressure (i.e., if you use a register to store that pointer, you can't use it to store something else at the same time). If you have an algorithm that would otherwise exactly fill the registers, but with enregistering a pointer would overflow to memory it can make a difference. At one time, that was a pretty big loss, but with most modern CPUs (with more registers and on-board cache) that's rarely a big issue. The obvious exception would be an embedded CPU with fewer registers and no cache (and without on-chip memory).
The bottom line is that it's usually pretty negligible, often below the threshold where you can even measure it dependably.
It does. It costs an extra fetch.
Accessing a variable by value, the variable is directly read from its memory location.
Accessing the same through pointer adds an overhead of fetching the address of the variable from the pointer and then reading the value from that memory location.
Ofcourse, Assuming that the variable is not placed in a register, which it would be in some scenarios like tight loops. I believe the Question seeks answer of an overhead assuming no such scenarios.
Related
I've been reading CppCoreGuidelines F.15 and I don't understand the following sentences from the table of parameter passing:
"Cheap" ≈ a handful of hot int copies
"Moderate cost" ≈ memcpy hot/contiguous ~1KB and no allocation
What does "hot int copy" mean?
"Hot" in this case likely refers to the likelihood of being cached. A particular piece of memory is "cold" if it is likely not in the cache, due to not having been touched recently within this thread of execution. Conversely, a piece of memory is "hot" if it likely has been touched recently, or is contiguous with memory that has been recently touched.
So it's talking about the cost of doing a memory copy of something that is currently in the cache and is therefore cheap in terms of actual memory bandwidth.
For example, consider a function that returns an array<int, 50>. If the values in that array were generated by the function itself, then those integers are "hot", since they're still almost certainly in the cache. So returning it by value is considered OK.
However, if there is some data structure that contains such a type, this function could have simply retrieved a pointer to that object. Returning it by value means doing several uncached memory accesses, since you have to copy to the return value. That is less than ideal from a memory cache perspective, so perhaps returning a pointer to the array would be more appropriate.
Obviously, uncached accesses will happen either way, but in the latter case, the caller gets to decide which accesses to perform and which not to perform.
I am wondering how a std::vector of pointers to objects affects the performance of a program, vs. using a std::vector directly containing the objects. Specifically I am referring to the speed of the program.
I was taught to use the std::vector over other STL such as the std::list for it's speed, due to all of it's data being stored contiguously in memory, rather than being fragmented. This meant that iterating over the elements was fast, however my thinking is that if my vector contains pointers to the objects, then the objects can still be stored anywhere in memory and only the pointers are stored contiguously. I am wondering how this would affect the performance of a program when it comes to iterating over the vector and accessing the objects.
My current project design uses a vector of pointers so that I can take advantage of virtual functions however i'm unsure whether this is worth the speed hit i may encounter when my vector becomes very large. Thanks for your help!
If you need the polymorphism, as people have said, you should store the pointers to the base. If, later, you decide this code is hot and needs to optimise it's cpu cache usage you can do that say by making the objects fit cleanly in cache lanes and/or with a custom allocator to ensure code locality of the dereferenced data.
Slicing is when you store objects by value Base and copy construct or allocate into them a Derived, Derived will be Sliced, the copy constructor or allocator only takes a Base and will ignore any data in Derived, there isn't enough space allocated in Base to take the full size of Derived. i.e. if Base is 8 bytes and Derived is 16, there's only enough room for Base's 8 bytes in the destination value even if you provided a copy constructor/allocator that explicitly took a Dervived.
I should say it's really not worth thinking about data cache coherence if you're using virtualisation heavily that wont be elided by the optimiser. An instruction cache miss is far more devestating than a data cache miss and virtualisation can cause instruction cache misses because it has to look up the vtable pointer before loading the function into the instruction cache and so can't preemptively load them.
CPU's tend to like to preload as much data as they can into caches, if you load an address an entire cache lane (~64 bytes) will be loaded into a cache lane and often it will also load the cache lane before and after which is why people are so keen on data locality.
So in your vector of pointers scenario when you load the first pointer you'll get lots of pointers in the cache at once loading through the pointer will trigger a cache miss and load up the data around that object, if your actual particles are 16 bytes and local to each other, you wont lose much beyond this. if they're all over the heap and massive you will get very cache churny on each iteration and relatively ok while working on a particle.
Traditionally, particle systems tend to be very hot and like to pack data tightly, it's common to see 16 byte Plain Old Data particle systems which you iterate linearly over with very predicatble branching. meaning you can generally rely on 4 particles per cache lane and have the prefetcher stay well ahead of your code.
I also should say that cpu caches are cpu dependant and i'm focusing on intel x86. Arm for example tends to be quite a bit behind intel & the pipeline is less complex, the prefetcher less capable and so cache misses can be less devastating.
i'm really confused when i write
int *ptr;
is this just a normal variable just hold the address of another variable or its a complex thing located in CPU registers for direct access
i need a clear answer is the pointer a variable or not?
A pointer is a variable, in that it can be changed to point to other instances of the same data type.
In most processors, it is represented as an address within the processor's full address range.
The code generated by the compiler may emit code to load the value of the pointer variable into a register, then emit code to operate on the register. One operation would be to dereference the pointer. In other words, the compiler emits code to load a register with the value at the address represented by the pointer. This is also known as indirection.
Although direct access is faster than indirect access, the difference in execution time is usually negligible. For example, if a direct access takes 50 nanoseconds and indirection takes 60 nanoseconds, the difference would be 10 nanoseconds. Your program would need to perform 100000 or more indirections to make a noticeable time difference. There are special cases where this kind of optimizations is necessary; but not for most applications. The time wasted waiting for User input or I/O from a hard drive makes the time difference between direct memory access and indirect memory access insignificant.
The fastest "variable" accesses are listed in order:
Processor Register
Direct fetching from the data cache.
Direct fetching from memory on the chip, but outside the CPU core.
Indirect fetching from memory on the chip, but outside the CPU core.
Direct fetching from memory off of the System On a Chip.
Indirect fetching from memory off of the System On a Chip.
Fetching data from an I/O port.
If you think that indirection is still of concern, profile your code. For high accuracy:
Find a test point (TP) on the hardware, LED, or some place you can
connect an oscilloscope probe to.
Assert the test point.
Perform the operations at least 100,000 iterations.
Deassert the test point.
Measure the width of pulse shown by the oscilloscope.
Another method is to read the system clock, perform 1E09 iterations, read clock again. Subtract the two clock readings.
I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?
C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.
On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).
It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.
Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.
This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?
You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.
Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).
There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.
As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.