Is every element access in std::vector a cache miss? - c++

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.
Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.
If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.
But if I iterate std::array addresses are in the same cache line so it should be faster.
I tested with VS2013 with full optimization and std::array is approx 20% faster.
Am I right in my assumptions?
Update: in order to not create the second similar topic. In this code I have an array and some local variable:
void test(array<int, 10>& arr)
{
int m{ 42 };
for (int i{ 0 }; i < arr.size(); ++i)
{
arr[i] = i * m;
}
}
In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?

Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.
You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.
Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.
So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.
Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.
Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.

I think it has nothing to do with cache miss.
You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.
You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.
For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.

Related

Faster alternative to push_back(size is known)

I have a float vector. As I process certain data, I push it back.I always know what the size will be while declaring the vector.
For the largest case, it is 172,490,752 floats. This takes about eleven seconds just to push_back everything.
Is there a faster alternative, like a different data structure or something?
If you know the final size, then reserve() that size after you declare the vector. That way it only has to allocate memory once.
Also, you may experiment with using emplace_back() although I doubt it will make any difference for a vector of float. But try it and benchmark it (with an optimized build of course - you are using an optimized build - right?).
The usual way of speeding up a vector when you know the size beforehand is to call reserve on it before using push_back. This eliminates the overhead of reallocating memory and copying the data every time the previous capacity is filled.
Sometimes for very demanding applications this won't be enough. Even though push_back won't reallocate, it still needs to check the capacity every time. There's no way to know how bad this is without benchmarking, since modern processors are amazingly efficient when a branch is always/never taken.
You could try resize instead of reserve and use array indexing, but the resize forces a default initialization of every element; this is a waste if you know you're going to set a new value into every element anyway.
An alternative would be to use std::unique_ptr<float[]> and allocate the storage yourself.
::boost::container::stable_vector Notice that allocating a contiguous block of 172 *4 MB might easily fail and requires quite a lot page joggling. Stable vector is essentially a list of smaller vectors or arrays of reasonable size. You may also want to populate it in parallel.
You could use a custom allocator which avoids default initialisation of all elements, as discussed in this answer, in conjunction with ordinary element access:
const size_t N = 172490752;
std::vector<float, uninitialised_allocator<float> > vec(N);
for(size_t i=0; i!=N; ++i)
vec[i] = the_value_for(i);
This avoids (i) default initializing all elements, (ii) checking for capacity at every push, and (iii) reallocation, but at the same time preserves all the convenience of using std::vector (rather than std::unique_ptr<float[]>). However, the allocator template parameter is unusual, so you will need to use generic code rather than std::vector-specific code.
I have two answers for you:
As previous answers have pointed out, using reserve to allocate the storage beforehand can be quite helpful, but:
push_back (or emplace_back) themselves have a performance penalty because during every call, they have to check whether the vector has to be reallocated. If you know the number of elements you will insert already, you can avoid this penalty by directly setting the elements using the access operator []
So the most efficient way I would recommend is:
Initialize the vector with the 'fill'-constructor:
std::vector<float> values(172490752, 0.0f);
Set the entries directly using the access operator:
values[i] = some_float;
++i;
The reason push_back is slow is that it will need to copy all the data several times as the vector grows, and even when it doesn’t need to copy data it needs to check. Vectors grow quickly enough that this doesn’t happen often, but it still does happen. A rough rule of thumb is that every element will need to be copied on average once or twice; the earlier elements will need to be copied a lot more, but almost half the elements won’t need to be copied at all.
You can avoid the copying, but not the checks, by calling reserve on the vector when you create it, ensuring it has enough space. You can avoid both the copying and the checks by creating it with the right size from the beginning, by giving the number of elements to the vector constructor, and then inserting using indexing as Tobias suggested; unfortunately, this also goes through the vector an extra time initializing everything.
If you know the number of floats at compile time and not just runtime, you could use an std::array, which avoids all these problems. If you only know the number at runtime, I would second Mark’s suggestion to go with std::unique_ptr<float[]>. You would create it with
size_t size = /* Number of floats */;
auto floats = unique_ptr<float[]>{new float[size]};
You don’t need to do anything special to delete this; when it goes out of scope it will free the memory. In most respects you can use it like a vector, but it won’t automatically resize.

What is the theoretical impact of direct index access with "high" memory usage vs. "shifted" index access with "low" memory usage?

Well I am really curious as to what practice is better to keep, I know it (probably?) does not make any performance difference at all (even in performance critical applications?) but I am more curious about the impact on the generated code with optimization in mind (and for the sake of completeness, also "performance", if it makes any difference).
So the problem is as following:
element indexes range from A to B where A > 0 and B > A (eg, A = 1000 and B = 2000).
To store information about each element there are a few possible solutions, two of those which use plain arrays include direct index access and access by manipulating the index:
example 1
//declare the array with less memory, "just" 1000 elements, all elements used
std::array<T, B-A> Foo;
//but make accessing by index slower?
//accessing index N where B > N >= A
Foo[N-A];
example 2
//or declare the array with more memory, 2000 elements, 50% elements not used, not very "efficient" for memory
std::array<T, B> Foo;
//but make accessing by index faster?
//accessing index N where B > N >= A
Foo[N];
I'd personally go for #2 because I really like performance, but I think in reality:
the compiler will take care of both situations?
What is the impact on optimizations?
What about performance?
does it matter at all?
Or is this just the next "micro optimization" thing that no human being should worry about?
Is there some Tradeoff ratio between memory usage : speed which is recommended?
Accessing any array with an index involves adding an index multiplied by element size and adding it to the base-address of the array itself.
Since we are already adding one number to another, making the adjustment for foo[N-A] could easily be done by adjusting the base-address down by N * sizeof(T) before adding A * sizeof(T), rather than actually calculating (A-N)*sizeof(T).
In other words, any decent compiler should comletely hide this subtraction, assuming it is a constant value.
If it's not a constant [say you are using std::vector instread of std::array, then you will indeed subtract A from N at some point in the code. It is still pretty cheap to do this. Most modern processors can do this in one cycle with no latency for the result, so at worst adds a single clock-cycle to the access.
Of course, if the numbers are 1000-2000, probably makes really little difference in the whole scheme of things - either the total time to process that is nearly nothing, or it's a lot becuase you do complicated stuff. But if you were to make it a million elements, offset by half a million, it may make the difference between a simple or complex method of allocating them, or some such.
Also, as Hans Passant implies: Modern OS's with virutal memory handling, memory that isn't actually used doesn't get populated with "real memory". At work I was investigating a strange crash on a board that has 2GB of RAM, and when viewing the memory usage, it showed that this one applciation had allocated 3GB of virtual memory. This board does not have a swap-disk (it's an embedded system). It turns out that some code was simply allocating large chunks of memory that wasn't filled with anything, and it only stopped working when it reached 3GB (32-bit processor, 3+1GB memory split between user/kernel space). So even for LARGE lumps of memory, if you only have half of it, it won't actually take up any RAM, if you do not actually access it.
As ALWAYS when it comes to performance, compilers and such, if it's important, do not trust "the internet" to tell you the answer. Set up a test with the code you actually intend to use, using the actual compiler(s) and processor type(s) that you plan to produce your code with/for, and run benchmarks. Some compiler may well have a misfeature (on processor type XYZ9278) that makes it produce horrible code for a case that most other compilers do this "with no overhead at all".

Cache performance degradation due to physical layout of data

Each memory address "maps" to their own cache set in the CPU cache(s), based on a modulo operation of the address.
Is there a way in which accessing two identically-sized arrays, like so:
int* array1; //How does the alignment affect the possibility of cache collisions?
int* array2;
for(int i=0; i<array1.size(); i++){
x = array1[i] * array2[i]; //Can these ever not be loaded in cache at same time?
}
can cause a performance decrease because the element at array1[i] and array2[i] give the same cache line modulo result? Or, would this actually be a performance increase because only one cache line would have to be loaded to obtain two data locations?
Would somebody be able to give an example of the above showing performance changes due to cache mappings, including how the alignment of the arrays would affect this?
(The reason for my question is that I am trying to understand when a performance problem occurs due to data alignment/address mappings to the same cache line, which causes one of the pieces of data to not be stored in the cache)
NB: I may have mixed up the terms cache "line" and "set"- please feel free to correct.
Right now your code doesn't make much sense as you didn't allocate any memory for the arrays. The pointers are just 2 uninitialized vars sitting on the stack and pointing at nothing. Also, a pointer to int* doesn't really have size() function.
Assuming you fix all that, if you do allocate, you can decide whether to allocate the data contiguously or not. You could allocate 2*N integers for one pointer, and have the other point to the middle of that region.
The main consideration here is this - if the arrays are small enough as to not wrap around your desired cache level, having them mapped contiguously will avoid having to share the same cache sets between them. This may improve performance since simultaneous accesses to the same sets are often non-optimal due to HW considerations.
The thrashing consideration (will the two arrays throw each others' lines out of the cache) is not a problem really as most caches today enjoy some level of associativity - it means that the arrays can map to the same sets but live in different cache ways. If the arrays are too big and exceed the total number of ways together, then it means their address range wraps around the cache set mapping several times, in which case it doesn't really matter how it's aligned, you're still going to collide with some lines of the other array
for e.g., if you had 4 sets and 2 ways in the cache, and try mapping 2 arrays of 64 ints with an alignment offset, you'd still fill out your entire cache -
way0 way1
set 0 array1[0] array2[32]
set 1 array1[16] array2[48]
set 2 array1[32] array2[0]
set 3 array1[48] array2[16]
but as mentioned above - accesses within the same iteration would go to different sets, which may have some benefit.

std::list vs std::vector iteration

It is said that iterating through a vector (as in reading all it's element) is faster than iterating through a list, because of optimized cache.
Is there any ressource on the web that would quantify how much it impacts the performances ?
Also, would it be better to use a custom linked list, whom elements would be prealocated so that they are consecutive in memory?
The idea behind that is that I want to store elements in a certain order that won't change. I still need to be able to insert some at run time in the midle quickly, but most of them will still be consecutive, because the order won't change.
Does the fact that the elements are consecutive have an impact in the cache, or because I'll still call list_element->next instead of ++list_element it does not improve anything ?
The main difference between vector and lists is that in vector elements are constructed subsequently inside a preallocated buffer, while in a list elements are constructed one by one.
As a consequence, elements in a vector are granted to occupy a contiguous memory space, while list elements (unless some specific situations, like a custom allocator working that way) aren't granted to be so, and can be "sparse" around the memory.
Now, since the processor operates on a cache (that can be up to 1000 times faster than the main RAM) that remaps entire pages of the main memory, if elements are consecutive it is higly probable that they fits a same memory page and hence are moved all together in the cache when iteration begins. While proceeding, everything happens in the cache without further moving of data or further access to the slower RAM.
With list-s, since elements are sparse everywhere, "going to the next" means refer to an address that may not be in the same memory page of its previous, and hence, the cache needs to be updated upon every iteration step, accessing the slower RAM on each iteration.
The performance difference greatly depends on the processor and on the type of memory used for both the main RAM and the cache, and on the way the std::allocator (and ultimately operator new and malloc) are implemented, so a general number is impossible to be given.
(Note: great difference means bad RAM respect to to the cache, but may also means bad implementation on list-s)
The efficiency gains from cache coherency due to compact representation of data structures can be rather dramatic. In the case of vectors compared to lists, compact representation can be better not just for read but even for insertion (shifting in vectors) of elements up to the order of 500K elements for some particular architecture as demonstrated in Figure 3 of this article by Bjarne Stroustrup:
http://www2.research.att.com/~bs/Computer-Jan12.pdf
(Publisher site: http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2011.353)
I think that if this is a critical factor for your program, you should profile it on your architecture.
Not sure if I can explain it right but here's my view(i'm thinking along the lines of translated machine instruction below:),
Vector iterator(contiguous memory):
When you increment a vector iterator, the iterator value is simply added the size of the object(known at compile time) to point to the next object. In most CPUs this is anything from one to three instructions at most.
List iterator(linked list http://www.sgi.com/tech/stl/List.html):
When you increment a list iterator(the pointed object), the location of the forward link is located by adding some number to the base of the object pointed and then loaded up as the new value of the iterator. There is more than one memory access for this and is slower than the vector iteration operation.

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.