I have a class as follows:
typedef struct grid_cell_type {
int x;
int y;
grid_cell_type(int x0, int y0){
x=x0;
y=y0;
}
} grid_cell;
I'll be pumping approximately 100 million of these through a queue.
Right now, this happens as follows:
my_queue.push(new grid_cell(x0,y0));
The individual piece-wise allocation of all these objects seems as though it is probably not as quick as some mass-allocation.
Any thoughts as to the best strategy to pursue here?
These are small and self-contained objects - put them directly in the queue instead of putting the pointers.
In fact, on a 64-bit system and assuming int is 32-bit (which it is, for example, under Visual C++), the pointer will be as large as the object itself! So even if you have a bulk allocator, you still pay this price.
The general memory allocator will not just be expensive time-wise, it will also have a per-object overhead, which in this case will dwarf the object itself (does not apply for bulk allocator).
While you could devise a fairly efficient "bulk" allocation scheme, I think it's simpler to sidestep the issue and altogether avoid the individual object allocations.
--- EDIT ---
You can push elements to the std::queue like this:
struct grid_cell {
grid_cell(int x0, int y0) {
x=x0;
y=y0;
}
int x;
int y;
};
// ...
std::queue<grid_cell> q;
q.push(grid_cell(0, 0));
q.push(grid_cell(0, 1));
q.push(grid_cell(0, 2));
q.push(grid_cell(1, 0));
q.push(grid_cell(1, 1));
q.push(grid_cell(1, 2));
For the std::priority_queue, you'd need to decide how you want to order the elements.
--- EDIT 2 ---
#Richard Your code is quite different.
For each push, your code would allocate a new block of dynamic memory, construct the object in it (i.e. assign x and y) and then push the pointer to that block of memory to the queue.
My code constructs the object directly in its "slot" within the larger memory block that was pre-allocated by the queue itself. And as you already noted, few big allocations
are better than many small ones.
Your code is:
prone to memory leaks
you pay for extra storage for pointers,
prone to memory fragmentation and
there is a per-object overhead, as I already mentioned.
A specialized bulk allocator could remove the last two problems but why not remove them all?
--- EDIT 3 ---
As for speed, the general dynamic memory allocation is expensive (about 40-50 machine instructions for best allocators).
The specialized block allocator would be much faster, but you still have an issue of memory latency: keeping everything nicely together is guaranteed to achieve better cache locality and be much more suitable for CPU's prefetching logic than repeatedly "jumping" between the queue and the actual objects by de-referencing pointers.
You could do one big array of them and allocate out of it.
int allocation_index = 0;
grid_cell_type* cells = new grid_cell_type[100*1000*100];
my_queue.push(&cells[allocation_index++]);
You'll then avoid the overhead of 100 million little news. Cleanup is then as simple as delete [] cells;.
EDIT: In this particular case, what Branko said is probably your best bet. Assuming you're using std::queue, it will automatically allocate the memory you need. What I suggested would be better suited for larger objects.
Related
I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).
I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.
I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?
A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.
I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.
Are there large gains in PERFORMANCE to be made by allocating heap memory in advance and filling it incrementally?
Consider the VERY simplified example below:
byte * heapSpace = malloc (1 000 000);
int currentWriteSpot = 0;
struct A {
int x;
byte * extraSpace;
int extraSpaceLength;
};
//a1 needs 10 bytes of extra storage space:
A a1;
a1.x = 2;
a1.extraSpace = heapSpace + currentWriteSpot;
a1.extraSpaceLength = 10;
currentWriteSpot += 10;
//a2 needs 120 bytes of extra storage space:
A a2;
a2.x = 24;
a2.extraSpace = heapSpace + currentWriteSpot;
a2.extraSpaceLength = 120;
currentWriteSpot += 120;
// ... many more elements added
for ( ... ) {
//loop contiguously over the allocated elements, manipulating contents stored at "extraSpace"
}
free (heapSpace);
VS:
...
a1.extraSpace = malloc ( 10 );
a2.extraSpace = malloc ( 120 );
a3...
a4...
...
//do stuff
free (a1.extraSpace);
free (a2.extraSpace);
free ...
free ...
free ...
Or is this likely to simply add complexity without significant gains in performance?
Thanks folks!
First of all, doing this does not increase complexity; it decreases it. Because you have already determined at the beginning of your operation that malloc was successful, you don't need any further checks for failure, which would at least have to free the allocations already made and perhaps reverse other changes to various objects' states.
One of the other benefits, as you've noted, is performance. This will be a much bigger issue in multi-threaded programs where calls to malloc could result in lock contention.
Perhaps a more important benefit is avoiding fragmentation. If the entire data object is allocated together rather than in small pieces, freeing it will definitely return usable contiguous space of the entire size to the free memory pool to be used by later allocations. On the other hand, if you allocate each small piece separately, there's a good possibility that they won't be contiguous.
In addition to reducing fragmentation, allocating all the data as a single contiguous block also avoids per-allocation overhead (at least 8-16 bytes per allocation are wasted) and improves data locality for cache purposes.
By the way, if you're finding this sort of allocation strategy overly complex, you might try making a few functions to handle it for you, or using an existing library like GNU obstack.
The reason you would want to do this is if you need to guarantee consistent allocation times (where 'consistent' != 'fastest'). The biggest example is the draw loop of a game or other paint operation - it's far more important for it not to "hiccup" than getting an extra 2 FPS at the expense of consistency.
If all you want is to complete an operation as fast as possible, the Win7 LFH is quite fast, and is already doing this optimization for you (this tip is from the days back when the heap manager typically sucked and was really slow). That being said, I could be completely wrong - always profile your workload and see what works and what doesn't.
Generally it is best to let the memory manager do this kind of thing, but in some extreme cases (eg. LOTS of small allocates and de-allocates) can be better handled using your own implementation. Ie. you grab one big chunk of memory and allocated/deallocate as required. Generally such cases are going to be very simplified cases (eg. you own sparse matrix implementation) where you can apply domain-specific optimizations that the standard memory manager cannot do. Eg. in the sparse matrix example, each chunk of memory is going to be the same size. This makes garbage collection relatively simple - chunks of deallocated memory do not need to be joined - just a simple "in use" flag is required, etc,etc.
You should only request to the memory manager for as many blocks of memory as you need to be separately controllable- in an ideal world where we have infinite optimization time, of course. If you have several A objects that do not need to be lifetimed separately, then do not allocate them separately.
Of course, whether or not this is actually worth the more intensive optimization time, is another question.