Why do I need std::get_temporary_buffer? - c++

For what purpose I should use std::get_temporary_buffer? Standard says the following:
Obtains a pointer to storage sufficient to store up to n adjacent T objects.
I thought that the buffer will be allocated on the stack, but that is not true. According to the C++ Standard this buffer is actually not temporary. What advantages does this function have over the global function ::operator new, which doesn't construct the objects either. Am I right that the following statements are equivalent?
int* x;
x = std::get_temporary_buffer<int>( 10 ).first;
x = static_cast<int*>( ::operator new( 10*sizeof(int) ) );
Does this function only exist for syntax sugar? Why is there temporary in its name?
One use case was suggested in the Dr. Dobb's Journal, July 01, 1996 for implementing algorithms:
If no buffer can be allocated, or if it is smaller than requested, the algorithm still works correctly, It merely slows down.

Stroustrup says in "The C++ Programming Language" (§19.4.4, SE):
The idea is that a system may keep a number of fixed-sized buffers ready for fast allocation so that requesting space for n objects may yield space for more than n. It may also yield less, however, so one way of using get_temporary_buffer() is to optimistically ask for a lot and then use what happens to be available.
[...] Because get_temporary_buffer() is low-level and likely to be optimized for managing temporary buffers, it should not be used as an alternative to new or allocator::allocate() for obtaining longer-term storage.
He also starts the introduction to the two functions with:
Algorithms often require temporary space to perform acceptably.
... but doesn't seem to provide a definition of temporary or longer-term anywhere.
An anecdote in "From Mathematics to Generic Programming" mentions that Stepanov provided a bogus placeholder implementation in the original STL design, however:
To his surprise, he discovered years later that all the major vendors that provide STL implementations are still using this terrible implementation [...]

Microsoft's standard library guy says the following (here):
Could you perhaps explain when to use 'get_temporary_buffer'
It has a very specialized purpose. Note that it doesn't throw
exceptions, like new (nothrow), but it also doesn't construct objects,
unlike new (nothrow).
It's used internally by the STL in algorithms like stable_partition().
This happens when there are magic words like N3126 25.3.13
[alg.partitions]/11: stable_partition() has complexity "At most (last
- first) * log(last - first) swaps, but only linear number of swaps if there is enough extra memory." When the magic words "if there is
enough extra memory" appear, the STL uses get_temporary_buffer() to
attempt to acquire working space. If it can, then it can implement the
algorithm more efficiently. If it can't, because the system is running
dangerously close to out-of-memory (or the ranges involved are huge),
the algorithm can fall back to a slower technique.
99.9% of STL users will never need to know about get_temporary_buffer().

The standard says it allocates storage for up to n elements.
In other words, your example might return a buffer big enough for 5 objects only.
It does seem pretty difficult to imagine a good use case for this though. Perhaps if you're working on a very memory-constrained platform, it's a convenient way to get "as much memory as possible".
But on such a constrained platform, I'd imagine you'd bypass the memory allocator as much as possible, and use a memory pool or something you have full control over.

For what purpose I should use std::get_temporary_buffer?
The function is deprecated in C++17, so the correct answer is now "for no purpose, do not use it".

ptrdiff_t request = 12
pair<int*,ptrdiff_t> p = get_temporary_buffer<int>(request);
int* base = p.first;
ptrdiff_t respond = p.sencond;
assert( is_valid( base, base + respond ) );
respond may be less than request.
size_t require = 12;
int* base = static_cast<int*>( ::operator new( require*sizeof(int) ) );
assert( is_valid( base, base + require ) );
the actual size of base must greater or equal to require.

Perhaps (just a guess) it has something to do with memory fragmentation. If you heavily keep allocating and deallocating temporal memory, but each time you do it you allocate some long-term intended memory after allocating the temp but before deallocating it, you may end up with a fragmented heap (I guess).
So the get_temporary_buffer could be intended to be a bigger-than-you-would-need chunk of memory that is allocated once (perhaps there are many chunks ready for accepting multiple requests), and each time you need memory you just get one of the chunks. So the memory doesn't get fragmented.

Related

Utilize memory past the end of a std::vector using a custom overallocating allocator

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.
Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?
Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.
To make things a little bit more clear:
Let's say I have two vectors and I want to add them:
std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);
If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like
for(size_t i=0u; i<(n+x); i+=y)
{ // where y is the number of doubles per register and and divisor of (n+x)
auto ma = _aligned_load(a.data() + i);
auto mb = _aligned_load(b.data() + i);
_aligned_store(c.data() + i, _simd_add(ma, mb));
}
where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.
But still the vectors only contain n values and can be handled like vectors of size n.
Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.
There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...
Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.
They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size1 won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").
To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.
Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.
Furthermore, when I was actually implementing custom allocators2 , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".
Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).
This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.
1 Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.
2 In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.
For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?
EDIT (addressing the code added to the question)
In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.
Consider the following possible layout of the vector's allocation:
---====================++++++++++------.........
=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator
You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Does an allocation hint get used?

I was reading Why is there no reallocation functionality in C++ allocators? and Is it possible to create an array on the heap at run-time, and then allocate more space whenever needed?, which clearly state that reallocation of a dynamic array of objects is impossible.
However, in The C++ Standard Library by Josuttis, it states an Allocator, allocator, has a function allocate with the following syntax
pointer allocator::allocate(size_type num, allocator<void>::pointer hint = 0)
where the hint has an implementation defined meaning, which may be used to help improve performance.
Are there any implementations that take advantage of this?
I have gained significant performance advantages for iteration times on small scalar types in my plf::colony c++ container using hints with std::allocator under Visual Studio 2010-2013 (iteration speed increased by ~21%), and much smaller speedups under GCC 5.1. So it's safe to say that with those compilers and std::allocator, it makes a difference. But the difference will be compiler-dependent. I am not aware of the ratio of hint-ignoring to hint-observing allocators.
I'm not sure about specific implementations, but note that the allocator isn't allowed to return the hint pointer value before it's been passed to deallocate. So that can't be used as a primitive operation to form a reallocate.
The Standard says the hint must have been returned by a previous call to allocate. It says "The use of [the hint] is unspecified, but it is
intended as an aid to locality." So if you're allocating and releasing a sequence of similar-sized blocks on one thread, you might pass the previously-freed value to avoid cache contention between microprocessor caches.
Otherwise, when CPU B sees that you're using memory addresses still in CPU A's cache (even that memory contains objects that were destroyed according to C++), it must forward the junk data over the bus. Better to let CPU A and B each reuse their own respective cached addresses.
C++11 states, in 20.6.9.1 allocator members:
4 - [ Note: In a container member function, the address of an adjacent element is often a good choice to pass for the hint argument. — end note ]
[...]
6 - [...] The use of hint is unspecified, but intended as an aid
to locality if an implementation so desires.
Allocating new elements adjacent or close to existing elements in memory can aid performance by improving locality; because they are usually cached together, nearby elements will tend to travel together up the memory hierarchy and will not evict each other.

Efficient Array Reallocation in C++

How would I efficiently resize an array allocated using some standards-conforming C++ allocator? I know that no facilities for reallocation are provided in the C++ alloctor interface, but did the C++11 revision enable us to work with them more easily? Suppose that I have a class vec with a copy-assignment operator foo& operator=(const foo& x) defined. If x.size() > this->size(), I'm forced to
Call allocator.destroy() on all elements in the internal storage of foo.
Call allocator.deallocate() on the internal storage of foo.
Reallocate a new buffer with enough room for x.size() elements.
Use std::uninitialized_copy to populate the storage.
Is there some way that I more easily reallocate the internal storage of foo without having to go through all of this? I could provide an actual code sample if you think that it would be useful, but I feel that it would be unnecessary here.
Based on a previous question, the approach that I took for handling large arrays that could grow and shrink with reasonable efficiency was to write a container similar to a deque that broke the array down into multiple pages of smaller arrays. So for example, say we have an array of n elements, we select a page size p, and create 1 + n/p arrays (pages) of p elements. When we want to re-allocate and grow, we simply leave the existing pages where they are, and allocate the new pages. When we want to shrink, we free the totally empty pages.
The downside is the array access is slightly slower, in that given and index i, you need the page = i / p, and the offset into the page i % p, to get the element. I find this is still very fast however and provides a good solution. Theoretically, std::deque should do something very similar, but for the cases I tried with large arrays it was very slow. See comments and notes on the linked question for more details.
There is also a memory inefficiency in that given n elements, we are always holding p - n % p elements in reserve. i.e. we only ever allocate or deallocate complete pages. This was the best solution I could come up with in the context of large arrays with the requirement for re-sizing and fast access, while I don't doubt there are better solutions I'd love to see them.
A similar problem also arises if x.size() > this->size() in foo& operator=(foo&& x).
No, it doesn't. You just swap.
There is no function that will resize in place or return 0 on failure (to resize). I don't know of any operating system that supports that kind of functionality beyond telling you how big a particular allocation actually is.
All operating systems do however have support for implementing realloc, however, that does a copy if it cannot resize in place.
So, you can't have it because the C++ language would not be implementable on most current operating systems if you had to add a standard function to do it.
There are the C++11 rvalue reference and move constructors.
There's a great video talk on them.
Even if re-allocate exists, actually, you can only avoid #2 you mentioned in your question in a copy constructor. However in the case of internal buffer growing, re-allocate can save these four operations.
Is internal buffer of your array continuous? if so see the answer of your link
if not, Hashed array tree or array list may be your choice to avoid re-allocate.
Interestingly, the default allocator for g++ is smart enough to use the same address for consecutive deallocations and allocations of larger sizes, as long as there is enough unused space after the end of the initially-allocated buffer. While I haven't tested what I'm about to claim, I doubt that there is much of a time difference between malloc/realloc and allocate/deallocate/allocate.
This leads to a potentially very dangerous, nonstandard shortcut that may work if you know that there is enough room after the current buffer so that a reallocation would not result in a new address. (1) Deallocate the current buffer without calling alloc.destroy() (2) Allocate a new, larger buffer and check the returned address (3) If the new address equals the old address, proceed happily; otherwise, you lost your data (4) Call allocator.construct() for elements in the newly-allocated space.
I wouldn't advocate using this for anything other than satisfying your own curiosity, but it does work on g++ 4.6.

Why can't the runtime environment decide to apply delete or delete[] instead of the programmer?

I've read that the delete[] operator is needed because the runtime environment does not keep information about if the allocated block is an array of objects that require destructor calls or not, but it does actually keep information about where in memory is the allocated block stored, and also, of course, the size of the block.
It would take just one more bit of meta data to remember if destructors need to be called on delete or not, so why not just do that?
I'm pretty sure there's a good explanation, I'm not questioning it, I just wish to know it.
I think the reason is that C++ doesn't force you into anything you don't want. It would add extra metadata and if someone didn't use it, that extra overhead would be forced upon them, in contrast to the design goals of the C++ language.
When you want the capability you described, C++ does provide a way. It's called std::vector and you should nearly always prefer it, another sort of container, or a smart pointer over raw newand delete.
C++ lets you be efficient as possible so if they did have to track the number of elements in a block that would just be an extra 4 bytes used per block.
This could be useful to a lot of people, but it also prevents total efficiency for people that don't mind putting [].
It's similar to the difference between c++ and Java. Java can be much faster to program because you never have to worry about garbage collection, but C++, if programmed correctly, can be more efficient and use less memory because it doesn't have to store any of those variables and you can decide when to delete memory blocks.
It basically comes down to the language design not wanting to put too many restrictions on implementors. Many C++ runtimes use malloc() for ::operator new () and free() (more or less) for ::operator delete (). Standard malloc/free don't provide the bookkeeping necessary for recording a number of elements and provide no way of determining the malloc'd size at free time. Adding another level of memory manipulation between new Foo and malloc for every single object is, from the C/C++ point of view, a pretty big jump in complexity/abstraction. Among other things, adding this overhead to every object would screw up some memory management approaches that are designed knowing what the size of objects are.
There are two things that need be cleared up here.
First: the assumption that malloc keeps the precise size you asked.
Not really. malloc only cares about providing a block that is large enough. Although for efficiency reasons it won't probably overallocate much, it will still probably give you a block of a "standard" size, for example a 2^n bytes block. Therefore the real size (as in, the number of objects actually allocated) is effectively unknown.
Second: the "extra bit" required
Indeed, the information required for a given object to know whether it is part of an array or not would simply be an extra bit. Logically.
As far as implementation is concerned though: where would you actually put that bit ?
The memory allocated for the object itself should probably not be touched, the object is using it after all. So ?
on some platform, this could be kept in the pointer itself (some platforms ignore a portion of the bits), but this is not portable
so it would require extra storage, at least a byte, except that with alignment issues it could well amount to 8 bytes.
Demonstration: (not convincing as noted by sth, see below)
// A plain array of doubles:
+-------+-------+-------
| 0 | 1 | 2
+-------+-------+-------
// A tentative to stash our extra bit
+-------++-------++-------++
| 0 || 1 || 2 ||
+-------++-------++-------++
// A correction since we introduced alignment issues
// Note: double's aligment is most probably its own size
+-------+-------+-------+-------+-------+-------
| 0 | bit | 1 | bit | 2 | bit
+-------+-------+-------+-------+-------+-------
Humpf!
EDIT
Therefore, on most platforms (where address do matter), you will need to "extend" each pointer, and actually double their sizes (alignment issues).
Is it acceptable for all pointers to be twice as large only so that you can stash that extra bit? For most people I guess it would be. But C++ is not designed for most people, it is primarily designed for people who care about performance, whether speed or memory, and as such this is not acceptable.
END OF EDIT
So what is the correct answer ? The correct answer is that recovering information that the type system lost is costly. Unfortunately.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?
As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.
It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.
As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.