std::sort on container of pointers - c++

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?

Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.

If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.

It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.

Related

Keep vector members of class contiguous with class instance

I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.

c++ Alternative implementation to avoid shifting between RAM and SWAP memory

I have a program, that uses dynamic programming to calculate some information. The problem is, that theoretically the used memory grows exponentially. Some filters that I use limit this space, but for a big input they also can't avoid that my program runs out of RAM - Memory.
The program is running on 4 threads. When I run it with a really big input I noticed, that at some point the program starts to use the swap memory, because my RAM is not big enough. The consequence of this is, that my CPU-usage decreases from about 380% to 15% or lower.
There is only one variable that uses the memory which is the following datastructure:
Edit (added type) with CLN library:
class My_Map {
typedef std::pair<double,short> key;
typedef cln::cl_I value;
public:
tbb::concurrent_hash_map<key,value>* map;
My_Map() { map = new tbb::concurrent_hash_map<myType>(); }
~My_Map() { delete map; }
//some functions for operations on the map
};
In my main program I am using this datastructure as globale variable:
My_Map* container = new My_Map();
Question:
Is there a way to avoid the shifting of memory between SWAP and RAM? I thought pushing all the memory to the Heap would help, but it seems not to. So I don't know if it is possible to maybe fully use the swap memory or something else. Just this shifting of memory cost much time. The CPU usage decreases dramatically.
If you have 1 Gig of RAM and you have a program that uses up 2 Gb RAM, then you're going to have to find somewhere else to store the excess data.. obviously. The default OS way is to swap but the alternative is to manage your own 'swapping' by using a memory-mapped file.
You open a file and allocate a virtual memory block in it, then you bring pages of the file into RAM to work on. The OS manages this for you for the most part, but you should think about your memory usage so not to try to keep access to the same blocks while they're in memory if you can.
On Windows you use CreateFileMapping(), on Linux you use mmap(), on Mac you use mmap().
The OS is working properly - it doesn't distinguish between stack and heap when swapping - it pages you whatever you seem not to be using and loads whatever you ask for.
There are a few things you could try:
consider whether myType can be made smaller - e.g. using int8_t or even width-appropriate bitfields instead of int, using pointers to pooled strings instead of worst-case-length character arrays, use offsets into arrays where they're smaller than pointers etc.. If you show us the type maybe we can suggest things.
think about your paging - if you have many objects on one memory page (likely 4k) they will need to stay in memory if any one of them is being used, so try to get objects that will be used around the same time onto the same memory page - this may involve hashing to small arrays of related myType objects, or even moving all your data into a packed array if possible (binary searching can be pretty quick anyway). Naively used hash tables tend to flay memory because similar objects are put in completely unrelated buckets.
serialisation/deserialisation with compression is a possibility: instead of letting the OS swap out full myType memory, you may be able to proactively serialise them into a more compact form then deserialise them only when needed
consider whether you need to process all the data simultaneously... if you can batch up the work in such a way that you get all "group A" out of the way using less memory then you can move on to "group B"
UPDATE now you've posted your actual data types...
Sadly, using short might not help much because sizeof key needs to be 16 anyway for alignment of the double; if you don't need the precision, you could consider float? Another option would be to create an array of separate maps...
tbb::concurrent_hash_map<double,value> map[65536];
You can then index to map[my_short][my_double]. It could be better or worse, but is easy to try so you might as well benchmark....
For cl_I a 2-minute dig suggests the data's stored in a union - presumably word is used for small values and one of the pointers when necessary... that looks like a pretty good design - hard to improve on.
If numbers tend to repeat a lot (a big if) you could experiment with e.g. keeping a registry of big cl_Is with a bi-directional mapping to packed integer ids which you'd store in My_Map::map - fussy though. To explain, say you get 987123498723489 - you push_back it on a vector<cl_I>, then in a hash_map<cl_I, int> set [987123498723489 to that index (i.e. vector.size() - 1). Keep going as new numbers are encountered. You can always map from an int id back to a cl_I using direct indexing in the vector, and the other way is an O(1) amortised hash table lookup.

std::vector<A> vs std::vector<A*> difference for CPU

Lets discuss a case when I have a huge std::vector. I need to iterate on all elements and call print function. There are two cases. If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap, and store the pointers of the objects in the vector. In this case the objects will be distributed in all over the RAM.
In case copies of the objects are stored in std::vector<A>, when CPU brings data from RAM to CPU cache then it brings a chunk of memory, which contains multiple elements of the vector. In this case when you iterate on each element and call a function, then you know that multiple elements will be processed and only then the CPU will go to RAM to request the remaining part of data to process. And this is good because CPU does not have a lot of free cycles.
What about the case of the std::vector<A*>? When it brings a chunk of pointers is it easy for CPU to obtain objects by pointer? Or it should request from RAM the objects on which you call some functions and there will be cache misses and free CPU cycles? Is it bad compared with the case above in the aspect of performance?
At least in a typical case, when the CPU fetches a pointer (or a number of pointers) from memory, it will not automatically fetch the data to which those pointers refer.
So, in the case of the vector of pointers, when you load the item that each of those pointers refers to, you'll typically get a cache miss, and access will be substantially slower than if they were stored contiguously. This is particularly true when/if each item is relatively small, so a number of them could fit in a single cache line (for some level of cache--keep in mind that a current processor will often have two or three levels of cache, each of which might have a different line size).
It may, however, be possible to mitigate this to some degree. You can overload operator new for a class to control allocations of objects of that class. Using this, you can at least keep objects of that class together in memory. That doesn't guarantee that the items in a particular vector will be contiguous, but could improve locality enough to make a noticeable improvement in speed.
Also note that the vector allocates its data via an Allocator object (which defaults to std::allocator<T>, which, in turn, uses new). Although the interface is kind of a mess so it's harder than you'd generally like, you can define an allocator to act differently if you wish. This won't generally have much effect on a single vector, but if (for example) you have a number of vectors (each of fixed size) and want them to use memory next to each other, you could do that via the allocator object.
If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap
Regardless of using std::vector<A> or std::vector<A *>, the inner buffer of the vector will be allocated in the heap. You could, though, use an effecient memory pool to manage allocations and deletions, but you're still going to work with data on the heap.
Is it bad compared with the case above in the aspect of performance?
In the case of using std::vector<A *> without an specialized memory menagement, you may be lucky as to make the allocations and always get data nicely aligned in memory, but it is generally better to have the contiguous allocations performed by std::vector<A>. In the former case, it may take longer to have to reallocate the entire vector (since pointers are usually smaller than regular structs), but it will suffer from locality (considering memory accesses).
When it brings a chunk of pointers is it easy for CPU to obtain
objects by pointer?
No, it isn't. CPU doesn't know they're pointers (everything CPU sees is just a bunch of bits, no semantics involved) until it fetches "dereferencing" instruction.
Or it should request from RAM the objects on which you call some
functions and there will be cache misses and free CPU cycles?
That's right. CPU will try to load data corresponding to a cached pointer but it's likely that this data is located somewhere far away from recently accessed memory, so it'd be a cache miss.
Is it bad compared with the case above in the aspect of performance?
If the only thing you care about is accessing elements, then yes, it's bad. Yet in some cases vector of pointers is preferable. Namely, if your objects don't support moving (C++11 isn't mainstream yet) then vector copying becomes more expensive. Even if don't copy your vector it may be the case when you don't know in advance number of stored elements, so you can't call reverse(n) beforehand. Then all your objects will be copied when vector will exhaust its capacity and will be forced to resize.
But in the end it depends on concrete type. If your objects is small (tiny structs, ints or floats) then it's obviously better to work with then by copying because of overhead of pointers would be too big.

How can heap allocation hurt hardware cache hit ratio?

I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?
A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.

Access cost of dynamically created objects with dynamically allocated members

I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class
class A {
int a;
int b;
int* v;
};
where:
The memory for v will be allocated in the constructor.
v will be allocated once when an object of type A is created and will never need to be resized.
The size of v will vary across all instances of A.
The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.
If you have a good reason to care about performance...
Could having v dynamically allocated could mean that an instance of A and its member v
are not located together in memory?
If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".
If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.
If such fragmentation is a performance issue, are there any techniques that could
allow A and v to allocated in a continuous region of memory?
Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:
char* p = reinterpret_cast<char*>(malloc(sizeof(A) + sizeof(A::v)));
char* v = p + sizeof(A);
A* a = new (p) A(v);
// time passes
a->~A();
free(a);
Or are there any techniques to aid memory access such as pre-fetching scheme?
Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like __pf(my_a->v);
If the size of v or an acceptable maximum size could be known at compile time
would replacing v with a fixed sized array like int v[max_length] lead to better
performance?
Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.
The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).
Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
Yes, it that will be likely.
What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
cachegrind, shark.
If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.
Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.
If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?
It might or might not. It would at least make v local with the struct members.
Write code.
Profile
Optimize.
If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?
Could you just have one copy of the structure, and one (big) buffer of v, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.
The program should spend almost 100% of time in I/O.
If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.