I've been reading CppCoreGuidelines F.15 and I don't understand the following sentences from the table of parameter passing:
"Cheap" ≈ a handful of hot int copies
"Moderate cost" ≈ memcpy hot/contiguous ~1KB and no allocation
What does "hot int copy" mean?
"Hot" in this case likely refers to the likelihood of being cached. A particular piece of memory is "cold" if it is likely not in the cache, due to not having been touched recently within this thread of execution. Conversely, a piece of memory is "hot" if it likely has been touched recently, or is contiguous with memory that has been recently touched.
So it's talking about the cost of doing a memory copy of something that is currently in the cache and is therefore cheap in terms of actual memory bandwidth.
For example, consider a function that returns an array<int, 50>. If the values in that array were generated by the function itself, then those integers are "hot", since they're still almost certainly in the cache. So returning it by value is considered OK.
However, if there is some data structure that contains such a type, this function could have simply retrieved a pointer to that object. Returning it by value means doing several uncached memory accesses, since you have to copy to the return value. That is less than ideal from a memory cache perspective, so perhaps returning a pointer to the array would be more appropriate.
Obviously, uncached accesses will happen either way, but in the latter case, the caller gets to decide which accesses to perform and which not to perform.
Related
class C { ... };
std::vector<C> vc;
std::vector<C*> pvc;
std::vector<std::unique_ptr<C>> upvc;
Depending on the size of C, either the approach storing by value or the approach storing by pointer will be more efficient.
Is it possible to approximately know what this size is (on both a 32 and 64 bit platform)?
Yes, it is possible - benchmark it. Due to how CPU caches work these days, things are not simple anymore.
Check out this lecture about linked lists by Bjarne Stroustrup:
https://www.youtube.com/watch?v=YQs6IC-vgmo
Here is an excelent lecture by Scott Meyers about CPU caches: https://www.youtube.com/watch?v=WDIkqP4JbkE
Let's look at the details of each example before drawing any conclusions.
Vector of Objects
A vector of Objects has first, initial performance hit. When an object is added to the vector, it makes a copy. The vector will also make copies when it needs to expand the reserved memory. Larger objects will take more time to copy, as well as complex or compound objects.
Accessing the objects is very efficient - only one dereference. If your vector can fit inside a processor's data cache, this will be very efficient.
Vector of Raw Pointers
This may have an initialization performance hit. If the objects are in dynamic memory, the memory must be initialized first (allocated).
Copying a pointer into a vector is not dependent on the object size. This may be a performance savings depending on the object size.
Accessing the objects takes a performance hit. There are 2 deferences before you get to the object. Most processors don't follow pointers when loading their data cache. This may be performance hit because the processor may have to reload the data cache when dereferencing the pointer to the object.
Vector of Smart Pointers
A little bit more costly in performance than a raw pointer. However, the items will automatically be deleted when the vector is destructed. The raw pointers must be deleted before the vector can be destructed; or a memory leak is created.
Summary
The safest version is to have copies in the vector, but has performance hits depending on the size of the object and the frequency of reallocating the reserved memory area. A vector of pointers takes performance hits because of the double dereferencing, but doesn't incur extra performance hits when copying because pointers are a consistent size. A vector of smart pointers may take additional performance hits compared to a vector of raw pointers.
The real truth can be found by profiling the code. The performance savings of one data structure versus another may disappear when waiting for I/O operations, such as networking or file I/O.
Operations with the data structures may need to be performed a huge amount of times in order for the savings to be significant. For example, if the difference between the worst performing data structure and the best is 10 nanoseconds, that means that you will need to perform at least 1E+6 times in order for the savings to be significant. If a second is significant, expect to access the data structures more times (1E+9).
I suggest picking one data structure and moving on. Your time developing the code is worth more than the time that the program runs. Safety and Robustness are also more important. An unsafe program will consume more of your time fixing issues than a safe and robust version.
For a Plain Old Data (POD) type, a vector of that type is always more efficient than a vector of pointers to that type at least until sizeof(POD) > sizeof(POD*).
Almost always, the same is true for a POD type at least until sizeof(POD) > 2 * sizeof(POD*) due to superior memory locality and lower total memory usage compared to when you are dynamically allocating the objects at which to be pointed.
This kind of analysis will hold true up until sizeof(POD) crosses some threshold for your architecture, compiler and usage that you would need to discover experimentally through benchmarking. The above only puts lower bounds on that size for POD types.
It is difficult to say anything definitive about all non-POD types as their operations (e.g. - default constructor, copy constructors, assignment, etc.) can be as inexpensive as a POD's or arbitrarily more expensive.
C++ standard section 3.7.3.1 says that
"The order, contiguity, and initial value of storage allocated by
successive calls to an allocation function is unspecified."
What is the meaning of order & contiguity? Why it is unspecified? Why initial value is also unspecified?
order
Means that the allocator is not constrained to returning addresses that are steadily increasing or decreasing (or any other pattern). This makes sense as memory is usually recycled and reused many times during a program's lifetime. It would be possible to ask from an allocator that the order of storage is defined in some way, but there would be very little gain (if any) opposed to considerable overhead, both in code execution and memory efficiency.
In other words, if B is allocated after A, and C is allocated after B, then A, B, and C may appear in any ordering (ABC, BAC, CAB, CBA, ...) in memory. You know that each of the addresses of A, B, and C is valid, but nothing more.
As pointed out by #Deduplicator, there exists a well-known issue in concurrent programming called "the ABA problem". This is, indirectly, a consequence of the fact that a newly allocated object could in principle have any address (note that this is somewhat advanced).
The ABA problem occurs when you use a compare-exchange instruction to atomically modify memory locations, for example in a lockfree list or queue. Your assumption is that the compare-exchange detects when someone else has concurrently modified the pointer that you're trying to modify. However, the truth is that it only detects whether the bit pattern of the address is different. Now it may happen that you free an object and push another object (asking the allocator for storage) to the container and the allocator gives you back the exact same address -- this is absolutely legal. Another thread uses the compare-exchange instruction, and it passes just fine. Bang!
To guard against this, lockfree structures often use pointers with built-in reference numbers.
contiguity
This is a similar constraint, which basically says that there may or may not be "holes" between whatever the allocator returns in subsequent allocations.
This means, if you allocate an object of size 8, and then allocate another object, the allocator might return address A and address A+8. But it might also return address A+10 instead for the second allocation at its own discretion.
This is something that regularly happens with almost every allocation because allocators often store metadata in addition to the actual object, and usually organize memory in terms of "buckets" down to a particular size (most often 16 bytes).
So if you allocate an integer (usually 4 bytes) or a pointer (usually 4 or 8 bytes), then there will be a "hole" between this and the next thing you allocate.
Again, it would be possible to require that an allocator returns objects in a contiguous manner, but this would mean a serious performance impact compared to something that is relatively cheap.
initial value
This means no more and no less than you must properly initialize your objects and cannot assume that they have any particular value. No, memory will not be zero-initialized automatically[1].
Requiring from the allocator to zero-initialize memory would be possible, but it would be less efficient than possible (although not as devastating as the other two constraints).
[1] Well, in some cases, it will be. The operating system will normally zero out all pages before giving a new page to a process for the first time. This is done for security, so no secret/confidential data is leaked. That is, however, an implementation detail which is out of the scope of the C++ standard.
It essentially means that the new operator can allocate memory from wherever in the system it deems necessary, and to not rely on any ordering of allocation for your program.
Lets discuss a case when I have a huge std::vector. I need to iterate on all elements and call print function. There are two cases. If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap, and store the pointers of the objects in the vector. In this case the objects will be distributed in all over the RAM.
In case copies of the objects are stored in std::vector<A>, when CPU brings data from RAM to CPU cache then it brings a chunk of memory, which contains multiple elements of the vector. In this case when you iterate on each element and call a function, then you know that multiple elements will be processed and only then the CPU will go to RAM to request the remaining part of data to process. And this is good because CPU does not have a lot of free cycles.
What about the case of the std::vector<A*>? When it brings a chunk of pointers is it easy for CPU to obtain objects by pointer? Or it should request from RAM the objects on which you call some functions and there will be cache misses and free CPU cycles? Is it bad compared with the case above in the aspect of performance?
At least in a typical case, when the CPU fetches a pointer (or a number of pointers) from memory, it will not automatically fetch the data to which those pointers refer.
So, in the case of the vector of pointers, when you load the item that each of those pointers refers to, you'll typically get a cache miss, and access will be substantially slower than if they were stored contiguously. This is particularly true when/if each item is relatively small, so a number of them could fit in a single cache line (for some level of cache--keep in mind that a current processor will often have two or three levels of cache, each of which might have a different line size).
It may, however, be possible to mitigate this to some degree. You can overload operator new for a class to control allocations of objects of that class. Using this, you can at least keep objects of that class together in memory. That doesn't guarantee that the items in a particular vector will be contiguous, but could improve locality enough to make a noticeable improvement in speed.
Also note that the vector allocates its data via an Allocator object (which defaults to std::allocator<T>, which, in turn, uses new). Although the interface is kind of a mess so it's harder than you'd generally like, you can define an allocator to act differently if you wish. This won't generally have much effect on a single vector, but if (for example) you have a number of vectors (each of fixed size) and want them to use memory next to each other, you could do that via the allocator object.
If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap
Regardless of using std::vector<A> or std::vector<A *>, the inner buffer of the vector will be allocated in the heap. You could, though, use an effecient memory pool to manage allocations and deletions, but you're still going to work with data on the heap.
Is it bad compared with the case above in the aspect of performance?
In the case of using std::vector<A *> without an specialized memory menagement, you may be lucky as to make the allocations and always get data nicely aligned in memory, but it is generally better to have the contiguous allocations performed by std::vector<A>. In the former case, it may take longer to have to reallocate the entire vector (since pointers are usually smaller than regular structs), but it will suffer from locality (considering memory accesses).
When it brings a chunk of pointers is it easy for CPU to obtain
objects by pointer?
No, it isn't. CPU doesn't know they're pointers (everything CPU sees is just a bunch of bits, no semantics involved) until it fetches "dereferencing" instruction.
Or it should request from RAM the objects on which you call some
functions and there will be cache misses and free CPU cycles?
That's right. CPU will try to load data corresponding to a cached pointer but it's likely that this data is located somewhere far away from recently accessed memory, so it'd be a cache miss.
Is it bad compared with the case above in the aspect of performance?
If the only thing you care about is accessing elements, then yes, it's bad. Yet in some cases vector of pointers is preferable. Namely, if your objects don't support moving (C++11 isn't mainstream yet) then vector copying becomes more expensive. Even if don't copy your vector it may be the case when you don't know in advance number of stored elements, so you can't call reverse(n) beforehand. Then all your objects will be copied when vector will exhaust its capacity and will be forced to resize.
But in the end it depends on concrete type. If your objects is small (tiny structs, ints or floats) then it's obviously better to work with then by copying because of overhead of pointers would be too big.
I had asked previously a question on stackoverflow (if you are interested here's the link: Passing by reference "advanced" concept? )
Interestingly, one of the answers intrigued me and I felt it deserves a separate question.
const int& x = 40;
If 40 happens to be a value in the CPU cache (rvalue). Then would you, by writing that line, just reserved cache memory to hold the number 40 for the lifetime of your process? And isn't that a bad thing?
Thank you
The literal 40 almost certainly lives in some read-only memory, possibly in the assembler (for small values there are typically instructions which can set a register or address; for bigger values it would be living somewhere as constant). It doesn't live "in the cache". When you create a const reference to it, a temporary is constructed wherever the compiler sees fit to keep temporaries (probably on the stack). Whether this lives in any cache is up to the system.
If the address of this temporary is never taken, it may actually not even be created: All rules in the C++ standard are governed by the "as if"-rule. As a result, the reference and the literal would be identical. If the address of the const reference is ever taken, the compiler needs to decide where to put the object and you may, indeed, see a small performance impact.
You can't reserve space on the cache from your program
It isn't really in your control. The cache-control decisions are made by its own controller which studies temporal and spatial locality, among other things to decide which cache-lines to replace and which to keep.
There are usually multiple copies of your data, on different caches and the virtual-memory address-space (maps to physical memory + swap).
The way memory is managed is far more complex than that. The system generates a virtual address every-time, when dealing with memory.
This virtual address is translated into a physical address. This translation could yield an address on the cache, physical memory, etc. It does not necessarily map to one piece of memory. If it has been swapped out, it causes a page-fault and that page is loaded into memory (multiple levels).
The low level operations like cache management are not affected by your decisions at this level.
Is dereferencing a pointer notabley slower than just accessing that value directly? I suppose my question is - how fast is the deference operator?
Going through a pointer indirection can be much slower because of how a modern CPU works. But it has nothing much to do with runtime memory.
Instead, speed is affected by prediction and cache.
Prediction is easy when the pointer has not been changed or when it is changed in predictable ways (for example, increment or decrement by four in a loop). This allows the CPU to essentially run ahead of the actual code execution, figure out what the pointer value is going to be, and load that address into cache. Prediction becomes impossible when the pointer value is built by a complex expression like a hash function.
Cache comes into play because the pointer might point into memory that isn't in cache and it will have to be fetched. This is minimized if prediction works but if prediction is impossible then in the worst case you can have a double impact: the pointer is not in cache and the pointer target is not in cache either. In that worst-case the CPU would stall twice.
If the pointer is used for a function pointer, the CPU's branch predictor comes into play. In C++ virtual tables, the function values are all constant and the predictor has it easy. The CPU will have the code ready to run and in the pipeline when execution goes through the indirect jump. But, if it is an unpredictable function pointer the performance impact can be heavy because the pipeline will need to be flushed which wastes 20-40 CPU cycles with each jump.
Depends on stuff like:
whether the "directly accessed" value is in a register already, or on the stack (that's also a pointer indirection)
whether the target address is in cache already
the cache architecture, bus architecture etc.
ie, too many variables to usefully speculate about without narrowing it down.
If you really want to know, benchmark it on your specific hardware.
it requires a memory access more:
read the address stored into the pointer variable
read the value at the address read
This could not be equal to 2 simple operation, because it may require also more time due to access an address not already loaded in the cache.
Assuming you're dealing with a real pointer (not a smart pointer of some sort), the dereference operation doesn't consume (data) memory at all. It does (potentially) involve an extra memory reference though: one to load the pointer itself, the other to access the data pointed to by the pointer.
If you're using a pointer in a tight loop, however, it'll normally be loaded into a register for the duration. In this case, the cost is mostly in terms of extra register pressure (i.e., if you use a register to store that pointer, you can't use it to store something else at the same time). If you have an algorithm that would otherwise exactly fill the registers, but with enregistering a pointer would overflow to memory it can make a difference. At one time, that was a pretty big loss, but with most modern CPUs (with more registers and on-board cache) that's rarely a big issue. The obvious exception would be an embedded CPU with fewer registers and no cache (and without on-chip memory).
The bottom line is that it's usually pretty negligible, often below the threshold where you can even measure it dependably.
It does. It costs an extra fetch.
Accessing a variable by value, the variable is directly read from its memory location.
Accessing the same through pointer adds an overhead of fetching the address of the variable from the pointer and then reading the value from that memory location.
Ofcourse, Assuming that the variable is not placed in a register, which it would be in some scenarios like tight loops. I believe the Question seeks answer of an overhead assuming no such scenarios.