Using realloc in c++ - c++

std::realloc is dangerous in c++ if the malloc'd memory contains non-pod types. It seems the only problem is that std::realloc wont call the type destructors if it cannot grow the memory in situ.
A trivial work around would be a try_realloc function. Instead of malloc'ing new memory if it cannot be grown in situ, it would simply return false. In which case new memory could be allocated, the objects copied (or moved) to the new memory, and finally the old memory freed.
This seems supremely useful. std::vector could make great use of this, possibly avoiding all copies/reallocations.
preemptive flame retardant: Technically, that is same Big-O performance, but if vector growth is a bottle neck in your application a x2 speed up is nice even if the Big-O remains unchanged.
BUT, I cannot find any c api that works like a try_realloc.
Am I missing something? Is try_realloc not as useful as I imagine? Is there some hidden bug that makes try_realloc unusable?
Better yet, Is there some less documented API that performs like try_realloc?
NOTE: I'm obviously, in library/platform specific code here. I'm not worried as try_realloc is inherently an optimization.
Update:
Following Steve Jessops comment's on whether vector would be more efficient using realloc I wrote up a proof of concept to test. The realloc-vector simulates a vector's growth pattern but has the option to realloc instead. I ran the program up to a million elements in the vector.
For comparison a vector must allocate 19 times while growing to a million elements.
The results, if the realloc-vector is the only thing using the heap the results are awesome, 3-4 allocation while growing to the size of million bytes.
If the realloc-vector is used alongside a vector that grows at 66% the speed of the realloc-vector The results are less promising, allocating 8-10 times during growth.
Finally, if the realloc-vector is used alongside a vector that grows at the same rate, the realloc-vector allocates 17-18 times. Barely saving one allocation over the standard vector behavior.
I don't doubt that a hacker could game allocation sizes to improve the savings, but I agree with Steve that the tremendous effort to write and maintain such an allocator isn't work the gain.

vector generally grows in large increments. You can't do that repeatedly without relocating, unless you carefully arrange things so that there's a large extent of free addresses just above the internal buffer of the vector (which in effect requires assigning whole pages, because obviously you can't have other allocations later on the same page).
So I think that in order to get a really good optimization here, you need more than a "trivial workaround" that does a cheap reallocation if possible - you have to somehow do some preparation to make it possible, and that preparation costs you address space. If you only do it for certain vectors, ones that indicate they're going to become big, then it's fairly pointless, because they can indicate with reserve() that they're going to become big. You can only do it automatically for all vectors if you have a vast address space, so that you can "waste" a big chunk of it on every vector.
As I understand it, the reason that the Allocator concept has no reallocation function is to keep it simple. If std::allocator had a try_realloc function, then either every Allocator would have to have one (which in most cases couldn't be implemented, and would just have to return false always), or else every standard container would have to be specialized for std::allocator to take advantage of it. Neither option is a great Allocator interface, although I suppose it wouldn't be a huge effort for implementers of almost all Allocator classes just to add a do-nothing try_realloc function.
If vector is slow due to re-allocation, deque might be a good replacement.

You could implement something like the try_realloc you proposed, using mmap with MAP_ANONYMOUS and MAP_FIXED and mremap with MREMAP_FIXED.
Edit: just noticed that the man page for mremap even says:
mremap() uses the Linux page table
scheme. mremap() changes the
mapping between
virtual addresses and memory pages. This can be used to implement
a very efficient
realloc(3).

realloc in C is hardly more than a convenience function; it has very little benefit for performance/reducing copies. The main exception I can think of is code that allocates a big array then reduces the size once the size needed is known - but even this might require moving data on some malloc implementations (ones which segregate blocks strictly by size) so I consider this usage of realloc really bad practice.
As long as you don't constantly reallocate your array every time you add an element, but instead grow the array exponentially (e.g. by 25%, 50%, or 100%) whenever you run out of space, just manually allocating new memory, copying, and freeing the old will yield roughly the same (and identical, in case of memory fragmentation) performance to using realloc. This is surely the approach that C++ STL implementations use, so I think your whole concern is unfounded.
Edit: The one (rare but not unheard-of) case where realloc is actually useful is for giant blocks on systems with virtual memory, where the C library interacts with the kernel to relocate whole pages to new addresses. The reason I say this is rare is because you need to be dealing with very big blocks (at least several hundred kB) before most implementations will even enter the realm of dealing with page-granularity allocation, and probably much larger (several MB maybe) before entering and exiting kernelspace to rearrange virtual memory is cheaper than simply doing the copy. Of course try_realloc would not be useful here, since the whole benefit comes from actually doing the move inexpensively.

Related

Pointer against not pointer

I read in many places including Effective C++ that it is better to store data on the stack and not as pointer to the data.
I can understand doing this with small object, because the number of new and delete calls is also reduced, which reduces the chance of a memory leak. Also, the pointer can take more space than the object itself.
But with large object, where copying them will be expensive, is it not better to store them in a smart pointer?
Because with many operations with the large object there will be few object copying which is very expensive (I am not including the getters and setters).
Let's focus purely on efficiency. There's no one-size-fits-all, unfortunately. It depends on what you are optimizing for. There's a saying, always optimize the common case. But what is the common case? Sometimes the answer lies in understanding your software's design inside out. Sometimes it's unknowable even at the high level in advance because your users will discover new ways to use it that you didn't anticipate. Sometimes you will extend the design and reveal new common cases. So optimization, but especially micro-optimization, is almost always best applied in hindsight, based on both this user-end knowledge and with a profiler in your hand.
The few times you can usually have really good foresight about the common case is when your design is forcing it rather than responding to it. For example, if you are designing a class like std::deque, then you're forcing the common case write usage to be push_fronts and push_backs rather than insertions to the middle, so the requirements give you decent foresight as to what to optimize. The common case is embedded into the design, and there's no way the design would ever want to be any different. For higher-level designs, you're usually not so lucky. And even in the cases where you know the broad common case in advance, knowing the micro-level instructions that cause slowdowns is too often incorrectly guessed, even by experts, without a profiler. So the first thing any developer should be interested in when thinking about efficiency is a profiler.
But here's some tips if you do run into a hotspot with a profiler.
Memory Access
Most of the time, the biggest micro-level hotspots if you have any will relate to memory access. So if you have a large object that is just one contiguous block where all the members are getting accessed in some tight loop, it'll aid performance.
For example, if you have an array of 4-component mathematical vectors you're sequentially accessing in a tight algorithm, you'll generally fare far, far better if they're contiguous like so:
x1,y1,z1,w1,x2,y2,x2,w2...xn,yn,zn,wn
... with a single-block structure like this (all in one contiguous block):
x
y
z
w
This is because the machine will fetch this data into a cache line which will have the adjacent vectors' data inside of it when it's all tightly packed and contiguous in memory like this.
You can very quickly slow down the algorithm if you used something like std::vector here to represent each individual 4-component mathematical vector, where every single individual one stores the mathematical components in a potentially completely different place in memory. Now you could potentially have a cache miss with each vector. In addition, you're paying for additional members since it's a variable-sized container.
std::vector is a "2-block" object that often looks like this when we use it for a mathematical 4-vector:
size
capacity
ptr --> [x y z w] another block
It also stores an allocator but I'll omit that for simplicity.
On the flip side, if you have a big "1-block" object where only some of its members get accessed in those tight, performance-critical loops, then it might be better to make it into a "2-block" structure. Say you have some Vertex structure where the most-accessed part of it is the x/y/z position but it also has a less commonly-accessed list of adjacent vertices. In that case, it might be better to hoist that out and store that adjacency data elsewhere in memory, perhaps even completely outside of the Vertex class itself (or merely a pointer), because your common case, performance-critical algorithms not accessing that data will then be able to access more contiguous vertices nearby in a single cache line since the vertices will be smaller and point to that rarely-accessed data elsewhere in memory.
Creation/Destruction Overhead
When rapid creation and destruction of objects is a concern, you can also do better to create each object in a contiguous memory block. The fewer separate memory blocks per object, the faster it'll generally go (since whether or not this stuff is going on the heap or stack, there will be fewer blocks to allocate/deallocate).
Free Store/Heap Overhead
So far I've been talking more about contiguity than stack vs. heap, and it's because stack vs. heap relates more to client-side usage of an object rather than an object's design. When you're designing the representation of an object, you don't know whether it's going on the stack or heap. What you do know is whether it's going to be fully contiguous (1 block) or not (multiple blocks).
But naturally if it's not contiguous, then at least part of it is going on the heap, and heap allocations and deallocations can be enormously expensive if you are relating the cost to the hardware stack. However, you can mitigate this overhead often with the use of efficient O(1) fixed allocators. They serve a more special purpose than malloc or free, but I would suggest concerning yourself less with the stack vs. heap distinction and more about the contiguity of an object's memory layout.
Copy/Move Overhead
Last but not least, if you are copying/swapping/moving objects a lot, then the smaller they are, the cheaper this is going to be. So you might want to sort pointers or indices to big objects sometimes, for example, instead of the original object, since even a move constructor for a type T where sizeof(T) is a large number is going to be expensive to copy/move.
So move constructing something like the "2-block" std::vector here which is not contiguous (its dynamic contents are contiguous, but that's a separate block) and stores its bulky data in a separate memory block is actually going to be cheaper than move constructing like a "1-block" 4x4 matrix that is contiguous. It's because there's no such thing as a cheap shallow copy if the object is just one big memory block rather than a tiny one with a pointer to another. One of the funny trends that arises is that objects which are cheap to copy are expensive to move, and ones which are very expensive to copy are cheap to move.
However, I would not let copying/move overhead impact your object implementation choices, because the client can always add a level of indirection there if he needs for a particular use case that taxes copies and moves. When you're designing for memory layout-type micro-efficiency, the first thing to focus on is contiguity.
Optimization
The rule for optimization is this: if you have no code or no tests or no profiling measurements, don't do it. As others have wisely suggested, your number one concern is always productivity (which includes maintainability, safety, clarity, etc). So instead of trapping yourself in hypothetical what-if scenarios, the first thing to do is to write the code, measure it twice, and change it if you really have to do so. It's better to focus on how to design your interfaces appropriately so that if you do have to change anything, it'll just affect one local source file.
The reality is that this is a micro-optimisation. You should write the code to make it readable, maintainable and robust. If you worry about speed, you use a profiling tool to measure the speed. You find things that take more time than they should, and then and only then do you worry about speed optimisation.
An object should obviously only exist once. If you make multiple copies of an object that is expensive to copy you are wasting time. You also have different copies of the same object, which is in itself not a good thing.
"Move semantics" avoids expensive copying in cases where you didn't really want to copy anything but just move an object from here to there. Google for it; it is quite an important thing to understand.
What you said is essentially correct. However, move semantics alleviate the concern about object copying in a large number of cases.

Can you predict where in memory a vector might move when growing?

I'm learning about C++ and have a conceptual question. Let's say I have a vector. I know that my vector is stored in contiguous memory, but let's say my vector keeps growing and runs out of room to keep the memory contiguous. How can I predict where in memory the vector will go? I'm excluding the option of using functions that tell the vector where it should be in memory.
If it "runs out of room to keep the memory contiguous", then it simply won't grow. Attempting to add items past the currently allocated size will (typically) result in its throwing an exception (though technically, it's up to the allocator object to decide what to do--it's responsible for memory allocation, and responding when that's not possible.
Note, however, that this could result from running out of address space (especially on a 32-bit machine) rather than running out of actual memory. A typical virtual memory manager can reallocate physical pages (e.g., 4 KB or 8 KB chunks) and write data to the paging file if necessary to free physical memory if needed--but when/if there's not enough contiguous address space, there's not much that can be done.
The answer depends highly on your allocation strategy, but in general, the answer is no. Most allocators do not provide you with information where the next allocation will occur. If you were writing a custom allocator, then you could potentially make this information accessible, but doing so is not necessarily a good idea unless your use case specifically requires this knowledge.
The realloc function is the only C function which will attempt to grow your memory in place, and it makes no guarantees that it will do so.
Neither new nor malloc provide any information for where the "next" allocation will take place. You could potentially guess, if you knew the exact implementation details for your specific compiler, but this would be very unwise to rely on in a real program. Regarding specifically the std::allocator used for std::vector, it also does not provide details about where future allocations will take place.
Even if you could predict it in a particular situation, it would be extremely fragile - all it takes is one function you call to change to make another call to new or malloc [unless you are using a very specific allocation method - which is different from the "usual" method] to "break" where the next allocation is made.
If you KNOW that you need a certain size, you can use std::vector::resize() to set the size of the vector [or std::vector<int> vec(10000); to create a pre-sized to 10000, for example] - which of course is not guaranteed to work, but it guarantees that you never need "enough space to hold 3x the current content", which is what happens with std::vector when you grow it using push_back [and if you are REALLY unlucky, that means that your vector will use 2*n-1 elements, leaving n-1 unused, because your size is n-1 and you add ONE more element, which doubles the size, so now 2*n, and you only actually require one more element...
The internal workings of STL containers are kept private for good reasons. You should never be accessing any container elements through any mechanism other than the appropriate iterators; and it is not possible to acquire one of those on an element that does not yet exist.
You could however, supply an allocator and use that to deterministically place future allocations.
Can you predict where in memory a vector might move when growing?
As others like EJP, Jerry and Mats have said, you cannot determine the location of a "grown" vector until after it grows. There are some corner cases, like the allocator providing a block of memory that's larger than required so that the vector does not actually move after a grow. But its not something you should depend on.
In general, stacks grow down and heaps grow up. This is an artifact from the old memory days. Your code segment was sandwiched between them, and it ensured your program would overwrite its own code segment and eventually cause an illegal instruction. So you might be able to guess the new vector is going to be higher in memory than the old vector because the vector is probably using heap memory. But its not really useful information.
If you are devising a strategy for locating elements after a grow, then use an index and not an iterator. Iterators are invalidated after inserts and deletes (including the grow).
For example, suppose you are parsing the vector and you are looking for the data that follows -----BEGIN CERTIFICATE-----. Once you know the offset of the data (byte 27 in the vector), then you can always relocate it in constant time with v.begin() + 26. If you only have part of the certificate and later add the tail of the data and the -----END CERTIFICATE----- (and the vector grows), then the data is still located at v.begin() + 26.
No, in practical terms you can't predict where it will go if it has to move due to resizing. However, it isn't so random that you could use it as a random number generator (;

How can heap allocation hurt hardware cache hit ratio?

I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?
A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.

Should I use boost fast pool allocator for following?

I have a server that throughout the course of 24 hours keeps adding new items to a set. Elements are not deleted over the 24 period, just new elements keep getting inserted.
Then at end of period the set is cleared, and new elements start getting added again for another 24 hours.
Do you think a fast pool allocator would be useful here as to reuse the memory and possibly help with fragmentation?
The set grows to around 1 million elements. Each element is about 1k.
It's highly unlikely …but you are of course free to test it in your program.
For a collection of that size and allocation pattern (more! more! more! + grow! grow! grow!), you should use an array of vectors. Just keep it in contiguous blocks and reserve() when they are created and you never need to reallocate/resize or waste space and bandwidth traversing lists. vector is going to be best for your memory layout with a collection that large. Not one big vector (which would take a long time to resize), but several vectors, each which represent chunks (ideal chunk size can vary by platform -- I'd start with 5MB each and measure from there). If you follow, you see there is no need to resize or reuse memory; just create an allocation every few minutes for the next N objects -- there is no need for high frequency/speed object allocation and recreation.
The thing about a pool allocator would suggest you want a lot of objects which have discontiguous allocations, lots of inserts and deletes like a list of big allocations -- this is bad for a few reasons. If you want to create an implementation which optimizes for contiguous allocation at this size, just aim for the blocks with vectors approach. Allocation and lookup will both be close to minimal. At that point, allocation times should be tiny (relative to the other work you do). Then you will also have nothing unusual or surprising about your allocation patterns. However, the fast pool allocator suggests you treat this collection as a list, which will have terrible performance for this problem.
Once you implement that block+vector approach, a better performance comparison (at that point) would be to compare boost's pool_allocator vs std::allocator. Of course, you could test all three, but memory fragmentation is likely going to be reduced far more by that block of vectors approach, if you implement it correctly. Reference:
If you are seriously concerned about performance, use fast_pool_allocator when dealing with containers such as std::list, and use pool_allocator when dealing with containers such as std::vector.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?
As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.
It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.
As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.