I had asked previously a question on stackoverflow (if you are interested here's the link: Passing by reference "advanced" concept? )
Interestingly, one of the answers intrigued me and I felt it deserves a separate question.
const int& x = 40;
If 40 happens to be a value in the CPU cache (rvalue). Then would you, by writing that line, just reserved cache memory to hold the number 40 for the lifetime of your process? And isn't that a bad thing?
Thank you
The literal 40 almost certainly lives in some read-only memory, possibly in the assembler (for small values there are typically instructions which can set a register or address; for bigger values it would be living somewhere as constant). It doesn't live "in the cache". When you create a const reference to it, a temporary is constructed wherever the compiler sees fit to keep temporaries (probably on the stack). Whether this lives in any cache is up to the system.
If the address of this temporary is never taken, it may actually not even be created: All rules in the C++ standard are governed by the "as if"-rule. As a result, the reference and the literal would be identical. If the address of the const reference is ever taken, the compiler needs to decide where to put the object and you may, indeed, see a small performance impact.
You can't reserve space on the cache from your program
It isn't really in your control. The cache-control decisions are made by its own controller which studies temporal and spatial locality, among other things to decide which cache-lines to replace and which to keep.
There are usually multiple copies of your data, on different caches and the virtual-memory address-space (maps to physical memory + swap).
The way memory is managed is far more complex than that. The system generates a virtual address every-time, when dealing with memory.
This virtual address is translated into a physical address. This translation could yield an address on the cache, physical memory, etc. It does not necessarily map to one piece of memory. If it has been swapped out, it causes a page-fault and that page is loaded into memory (multiple levels).
The low level operations like cache management are not affected by your decisions at this level.
Related
I've been reading CppCoreGuidelines F.15 and I don't understand the following sentences from the table of parameter passing:
"Cheap" ≈ a handful of hot int copies
"Moderate cost" ≈ memcpy hot/contiguous ~1KB and no allocation
What does "hot int copy" mean?
"Hot" in this case likely refers to the likelihood of being cached. A particular piece of memory is "cold" if it is likely not in the cache, due to not having been touched recently within this thread of execution. Conversely, a piece of memory is "hot" if it likely has been touched recently, or is contiguous with memory that has been recently touched.
So it's talking about the cost of doing a memory copy of something that is currently in the cache and is therefore cheap in terms of actual memory bandwidth.
For example, consider a function that returns an array<int, 50>. If the values in that array were generated by the function itself, then those integers are "hot", since they're still almost certainly in the cache. So returning it by value is considered OK.
However, if there is some data structure that contains such a type, this function could have simply retrieved a pointer to that object. Returning it by value means doing several uncached memory accesses, since you have to copy to the return value. That is less than ideal from a memory cache perspective, so perhaps returning a pointer to the array would be more appropriate.
Obviously, uncached accesses will happen either way, but in the latter case, the caller gets to decide which accesses to perform and which not to perform.
I am wondering how a std::vector of pointers to objects affects the performance of a program, vs. using a std::vector directly containing the objects. Specifically I am referring to the speed of the program.
I was taught to use the std::vector over other STL such as the std::list for it's speed, due to all of it's data being stored contiguously in memory, rather than being fragmented. This meant that iterating over the elements was fast, however my thinking is that if my vector contains pointers to the objects, then the objects can still be stored anywhere in memory and only the pointers are stored contiguously. I am wondering how this would affect the performance of a program when it comes to iterating over the vector and accessing the objects.
My current project design uses a vector of pointers so that I can take advantage of virtual functions however i'm unsure whether this is worth the speed hit i may encounter when my vector becomes very large. Thanks for your help!
If you need the polymorphism, as people have said, you should store the pointers to the base. If, later, you decide this code is hot and needs to optimise it's cpu cache usage you can do that say by making the objects fit cleanly in cache lanes and/or with a custom allocator to ensure code locality of the dereferenced data.
Slicing is when you store objects by value Base and copy construct or allocate into them a Derived, Derived will be Sliced, the copy constructor or allocator only takes a Base and will ignore any data in Derived, there isn't enough space allocated in Base to take the full size of Derived. i.e. if Base is 8 bytes and Derived is 16, there's only enough room for Base's 8 bytes in the destination value even if you provided a copy constructor/allocator that explicitly took a Dervived.
I should say it's really not worth thinking about data cache coherence if you're using virtualisation heavily that wont be elided by the optimiser. An instruction cache miss is far more devestating than a data cache miss and virtualisation can cause instruction cache misses because it has to look up the vtable pointer before loading the function into the instruction cache and so can't preemptively load them.
CPU's tend to like to preload as much data as they can into caches, if you load an address an entire cache lane (~64 bytes) will be loaded into a cache lane and often it will also load the cache lane before and after which is why people are so keen on data locality.
So in your vector of pointers scenario when you load the first pointer you'll get lots of pointers in the cache at once loading through the pointer will trigger a cache miss and load up the data around that object, if your actual particles are 16 bytes and local to each other, you wont lose much beyond this. if they're all over the heap and massive you will get very cache churny on each iteration and relatively ok while working on a particle.
Traditionally, particle systems tend to be very hot and like to pack data tightly, it's common to see 16 byte Plain Old Data particle systems which you iterate linearly over with very predicatble branching. meaning you can generally rely on 4 particles per cache lane and have the prefetcher stay well ahead of your code.
I also should say that cpu caches are cpu dependant and i'm focusing on intel x86. Arm for example tends to be quite a bit behind intel & the pipeline is less complex, the prefetcher less capable and so cache misses can be less devastating.
I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?
C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.
On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).
It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.
Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.
Is dereferencing a pointer notabley slower than just accessing that value directly? I suppose my question is - how fast is the deference operator?
Going through a pointer indirection can be much slower because of how a modern CPU works. But it has nothing much to do with runtime memory.
Instead, speed is affected by prediction and cache.
Prediction is easy when the pointer has not been changed or when it is changed in predictable ways (for example, increment or decrement by four in a loop). This allows the CPU to essentially run ahead of the actual code execution, figure out what the pointer value is going to be, and load that address into cache. Prediction becomes impossible when the pointer value is built by a complex expression like a hash function.
Cache comes into play because the pointer might point into memory that isn't in cache and it will have to be fetched. This is minimized if prediction works but if prediction is impossible then in the worst case you can have a double impact: the pointer is not in cache and the pointer target is not in cache either. In that worst-case the CPU would stall twice.
If the pointer is used for a function pointer, the CPU's branch predictor comes into play. In C++ virtual tables, the function values are all constant and the predictor has it easy. The CPU will have the code ready to run and in the pipeline when execution goes through the indirect jump. But, if it is an unpredictable function pointer the performance impact can be heavy because the pipeline will need to be flushed which wastes 20-40 CPU cycles with each jump.
Depends on stuff like:
whether the "directly accessed" value is in a register already, or on the stack (that's also a pointer indirection)
whether the target address is in cache already
the cache architecture, bus architecture etc.
ie, too many variables to usefully speculate about without narrowing it down.
If you really want to know, benchmark it on your specific hardware.
it requires a memory access more:
read the address stored into the pointer variable
read the value at the address read
This could not be equal to 2 simple operation, because it may require also more time due to access an address not already loaded in the cache.
Assuming you're dealing with a real pointer (not a smart pointer of some sort), the dereference operation doesn't consume (data) memory at all. It does (potentially) involve an extra memory reference though: one to load the pointer itself, the other to access the data pointed to by the pointer.
If you're using a pointer in a tight loop, however, it'll normally be loaded into a register for the duration. In this case, the cost is mostly in terms of extra register pressure (i.e., if you use a register to store that pointer, you can't use it to store something else at the same time). If you have an algorithm that would otherwise exactly fill the registers, but with enregistering a pointer would overflow to memory it can make a difference. At one time, that was a pretty big loss, but with most modern CPUs (with more registers and on-board cache) that's rarely a big issue. The obvious exception would be an embedded CPU with fewer registers and no cache (and without on-chip memory).
The bottom line is that it's usually pretty negligible, often below the threshold where you can even measure it dependably.
It does. It costs an extra fetch.
Accessing a variable by value, the variable is directly read from its memory location.
Accessing the same through pointer adds an overhead of fetching the address of the variable from the pointer and then reading the value from that memory location.
Ofcourse, Assuming that the variable is not placed in a register, which it would be in some scenarios like tight loops. I believe the Question seeks answer of an overhead assuming no such scenarios.
I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
The memory for v will be allocated in the constructor.
v will be allocated once when an object of type A is created and will never need to be resized.
The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
If you have a good reason to care about performance...
Could having v dynamically allocated could mean that an instance of A and its member v
are not located together in memory?
If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".
If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.
If such fragmentation is a performance issue, are there any techniques that could
allow A and v to allocated in a continuous region of memory?
Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:
char* p = reinterpret_cast<char*>(malloc(sizeof(A) + sizeof(A::v)));
char* v = p + sizeof(A);
A* a = new (p) A(v);
// time passes
a->~A();
free(a);
Or are there any techniques to aid memory access such as pre-fetching scheme?
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like __pf(my_a->v);
If the size of v or an acceptable maximum size could be known at compile time
would replacing v with a fixed sized array like int v[max_length] lead to better
performance?
Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
Yes, it that will be likely.
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
cachegrind, shark.
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
It might or might not. It would at least make v local with the struct members.
Write code.
Profile
Optimize.
If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?
Could you just have one copy of the structure, and one (big) buffer of v, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.
The program should spend almost 100% of time in I/O.
If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.