Does an allocation hint get used?

Does an allocation hint get used? - c++

I was reading Why is there no reallocation functionality in C++ allocators? and Is it possible to create an array on the heap at run-time, and then allocate more space whenever needed?, which clearly state that reallocation of a dynamic array of objects is impossible.
However, in The C++ Standard Library by Josuttis, it states an Allocator, allocator, has a function allocate with the following syntax
pointer allocator::allocate(size_type num, allocator<void>::pointer hint = 0)
where the hint has an implementation defined meaning, which may be used to help improve performance.
Are there any implementations that take advantage of this?

I have gained significant performance advantages for iteration times on small scalar types in my plf::colony c++ container using hints with std::allocator under Visual Studio 2010-2013 (iteration speed increased by ~21%), and much smaller speedups under GCC 5.1. So it's safe to say that with those compilers and std::allocator, it makes a difference. But the difference will be compiler-dependent. I am not aware of the ratio of hint-ignoring to hint-observing allocators.

I'm not sure about specific implementations, but note that the allocator isn't allowed to return the hint pointer value before it's been passed to deallocate. So that can't be used as a primitive operation to form a reallocate.
The Standard says the hint must have been returned by a previous call to allocate. It says "The use of [the hint] is unspecified, but it is
intended as an aid to locality." So if you're allocating and releasing a sequence of similar-sized blocks on one thread, you might pass the previously-freed value to avoid cache contention between microprocessor caches.
Otherwise, when CPU B sees that you're using memory addresses still in CPU A's cache (even that memory contains objects that were destroyed according to C++), it must forward the junk data over the bus. Better to let CPU A and B each reuse their own respective cached addresses.

C++11 states, in 20.6.9.1 allocator members:
4 - [ Note: In a container member function, the address of an adjacent element is often a good choice to pass for the hint argument. — end note ]
[...]
6 - [...] The use of hint is unspecified, but intended as an aid
to locality if an implementation so desires.
Allocating new elements adjacent or close to existing elements in memory can aid performance by improving locality; because they are usually cached together, nearby elements will tend to travel together up the memory hierarchy and will not evict each other.

Related

what happens in memory for move in C++

I am reading c++ primer move section and confused about the move implementation.
Let us say we have one vector that has 4 elements that occupy 4 consecutive memory locations.
MEM[0~3]. The capacity of the vector is 4.
Assuming MEM[4] is now not available as it is occupied by another thread or program or due to whatever reason. Is this possible?
Now we need to add another element. Because we must maintain consecutive memory, we can only find another piece of consecutive memory that can host 8 vector entries, for example MEM[5~12]. In this way, we do copy the contents from MEM[0~3] to MEM[5~8] and then add the new element at MEM[9], right?
There is no way we could reuse the old MEM[0~3] and increase capacity while maintaining consecutive addresses.
If it is linked list, i can understand the move. But for array like, I am a bit confused. Please help explain a bit. Thanks.

This is completely outside the scope of the C++ standard.
There's nothing in the standard that prohibits this optimization. It's certainly possible that a particular C++ implementation determines that the std::vector that already used up its reserve()d capacity can be expanded without allocating larger storage and moving the existing contents of the vector into the larger allocated storage.
If so, then this is a very plausible, and sensible optimization. But nothing in the C++ standard requires this specific optimization either. Given that std::vector's storage expansion algorithm already mandates that the resulting vector insertion must have constant amortized time complexity, it is reasonable to conclude that the additional complexity in tracking memory allocation to such a level of detail may produce only marginal gains, in exchange for larger overhead overall, and might actually prove to incur more overhead, overall.

Utilize memory past the end of a std::vector using a custom overallocating allocator

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.
Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?
Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.
To make things a little bit more clear:
Let's say I have two vectors and I want to add them:
std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);
If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like
for(size_t i=0u; i<(n+x); i+=y)
{ // where y is the number of doubles per register and and divisor of (n+x)
auto ma = _aligned_load(a.data() + i);
auto mb = _aligned_load(b.data() + i);
_aligned_store(c.data() + i, _simd_add(ma, mb));
}
where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.
But still the vectors only contain n values and can be handled like vectors of size n.

Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.
There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...
Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.
They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size1 won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").
To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.
Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.
Furthermore, when I was actually implementing custom allocators2 , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".
Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).
This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.
1 Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.
2 In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.

For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?
EDIT (addressing the code added to the question)
In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.
Consider the following possible layout of the vector's allocation:
---====================++++++++++------.........
=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator
You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Performance of container of objects vs performance of container of pointers

class C { ... };
std::vector<C> vc;
std::vector<C*> pvc;
std::vector<std::unique_ptr<C>> upvc;
Depending on the size of C, either the approach storing by value or the approach storing by pointer will be more efficient.
Is it possible to approximately know what this size is (on both a 32 and 64 bit platform)?

Yes, it is possible - benchmark it. Due to how CPU caches work these days, things are not simple anymore.
Check out this lecture about linked lists by Bjarne Stroustrup:
https://www.youtube.com/watch?v=YQs6IC-vgmo
Here is an excelent lecture by Scott Meyers about CPU caches: https://www.youtube.com/watch?v=WDIkqP4JbkE

Let's look at the details of each example before drawing any conclusions.
Vector of Objects
A vector of Objects has first, initial performance hit. When an object is added to the vector, it makes a copy. The vector will also make copies when it needs to expand the reserved memory. Larger objects will take more time to copy, as well as complex or compound objects.
Accessing the objects is very efficient - only one dereference. If your vector can fit inside a processor's data cache, this will be very efficient.
Vector of Raw Pointers
This may have an initialization performance hit. If the objects are in dynamic memory, the memory must be initialized first (allocated).
Copying a pointer into a vector is not dependent on the object size. This may be a performance savings depending on the object size.
Accessing the objects takes a performance hit. There are 2 deferences before you get to the object. Most processors don't follow pointers when loading their data cache. This may be performance hit because the processor may have to reload the data cache when dereferencing the pointer to the object.
Vector of Smart Pointers
A little bit more costly in performance than a raw pointer. However, the items will automatically be deleted when the vector is destructed. The raw pointers must be deleted before the vector can be destructed; or a memory leak is created.
Summary
The safest version is to have copies in the vector, but has performance hits depending on the size of the object and the frequency of reallocating the reserved memory area. A vector of pointers takes performance hits because of the double dereferencing, but doesn't incur extra performance hits when copying because pointers are a consistent size. A vector of smart pointers may take additional performance hits compared to a vector of raw pointers.
The real truth can be found by profiling the code. The performance savings of one data structure versus another may disappear when waiting for I/O operations, such as networking or file I/O.
Operations with the data structures may need to be performed a huge amount of times in order for the savings to be significant. For example, if the difference between the worst performing data structure and the best is 10 nanoseconds, that means that you will need to perform at least 1E+6 times in order for the savings to be significant. If a second is significant, expect to access the data structures more times (1E+9).
I suggest picking one data structure and moving on. Your time developing the code is worth more than the time that the program runs. Safety and Robustness are also more important. An unsafe program will consume more of your time fixing issues than a safe and robust version.

For a Plain Old Data (POD) type, a vector of that type is always more efficient than a vector of pointers to that type at least until sizeof(POD) > sizeof(POD*).
Almost always, the same is true for a POD type at least until sizeof(POD) > 2 * sizeof(POD*) due to superior memory locality and lower total memory usage compared to when you are dynamically allocating the objects at which to be pointed.
This kind of analysis will hold true up until sizeof(POD) crosses some threshold for your architecture, compiler and usage that you would need to discover experimentally through benchmarking. The above only puts lower bounds on that size for POD types.
It is difficult to say anything definitive about all non-POD types as their operations (e.g. - default constructor, copy constructors, assignment, etc.) can be as inexpensive as a POD's or arbitrarily more expensive.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?

As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.

It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.

As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.

Determining maximum possible alignment in C++

Is there any portable way to determine what the maximum possible alignment for any type is?
For example on x86, SSE instructions require 16-byte alignment, but as far as I'm aware, no instructions require more than that, so any type can be safely stored into a 16-byte aligned buffer.
I need to create a buffer (such as a char array) where I can write objects of arbitrary types, and so I need to be able to rely on the beginning of the buffer to be aligned.
If all else fails, I know that allocating a char array with new is guaranteed to have maximum alignment, but with the TR1/C++0x templates alignment_of and aligned_storage, I am wondering if it would be possible to create the buffer in-place in my buffer class, rather than requiring the extra pointer indirection of a dynamically allocated array.
Ideas?
I realize there are plenty of options for determining the max alignment for a bounded set of types: A union, or just alignment_of from TR1, but my problem is that the set of types is unbounded. I don't know in advance which objects must be stored into the buffer.

In C++11 std::max_align_t defined in header cstddef is a POD type whose alignment requirement is at least as strict (as large) as that of every scalar type.
Using the new alignof operator it would be as simple as alignof(std::max_align_t)

In C++0x, the Align template parameter of std::aligned_storage<Len, Align> has a default argument of "default-alignment," which is defined as (N3225 §20.7.6.6 Table 56):
The value of default-alignment shall be the most stringent alignment requirement for any C++ object type whose size is no greater than Len.
It isn't clear whether SSE types would be considered "C++ object types."
The default argument wasn't part of the TR1 aligned_storage; it was added for C++0x.

Unfortunately ensuring max alignment is a lot tougher than it should be, and there are no guaranteed solutions AFAIK. From the GotW blog (Fast Pimpl article):
union max_align {
short dummy0;
long dummy1;
double dummy2;
long double dummy3;
void* dummy4;
/*...and pointers to functions, pointers to
member functions, pointers to member data,
pointers to classes, eye of newt, ...*/
};
union {
max_align m;
char x_[sizeofx];
};
This isn't guaranteed to be fully
portable, but in practice it's close
enough because there are few or no
systems on which this won't work as
expected.
That's about the closest 'hack' I know for this.
There is another approach that I've used personally for super fast allocation. Note that it is evil, but I work in raytracing fields where speed is one of the greatest measures of quality and we profile code on a daily basis. It involves using a heap allocator with pre-allocated memory that works like the local stack (just increments a pointer on allocation and decrements one on deallocation).
I use it for Pimpls particularly. However, just having the allocator is not enough; for such an allocator to work, we have to assume that memory for a class, Foo, is allocated in a constructor, the same memory is likewise deallocated only in the destructor, and that Foo itself is created on the stack. To make it safe, I needed a function to see if the 'this' pointer of a class is on the local stack to determine if we can use our super fast heap-based stack allocator. For that we had to research OS-specific solutions: I used TIBs and TEBs for Win32/Win64, and my co-workers found solutions for Linux and Mac OS X.
The result, after a week of researching OS-specific methods to detect stack range, alignment requirements, and doing a lot of testing and profiling, was an allocator that could allocate memory in 4 clock cycles according to our tick counter benchmarks as opposed to about 400 cycles for malloc/operator new (our test involved thread contention so malloc is likely to be a bit faster than this in single-threaded cases, perhaps a couple of hundred cycles). We added a per-thread heap stack and detected which thread was being used which increased the time to about 12 cycles, though the client can keep track of the thread allocator to get the 4 cycle allocations. It wiped out memory allocation based hotspots off the map.
While you don't have to go through all that trouble, writing a fast allocator might be easier and more generally applicable (ex: allowing the amount of memory to allocate/deallocate to be determined at runtime) than something like max_align here. max_align is easy enough to use, but if you're after speed for memory allocations (and assuming you've already profiled your code and found hotspots in malloc/free/operator new/delete with major contributors being in code you have control over), writing your own allocator can really make the difference.

Short of some maximally_aligned_t type that all compilers promised faithfully to support for all architectures everywhere, I don't see how this could be solved at compile time. As you say, the set of potential types is unbounded. Is the extra pointer indirection really that big a deal?

Allocating aligned memory is trickier than it looks - see for example Implementation of aligned memory allocation

This is what I'm using. In addition to this, if you're allocating memory then a new()'d array of char with length greater than or equal to max_alignment will be aligned to max_alignment so you can then use indexes into that array to get aligned addresses.
enum {
max_alignment = boost::mpl::deref<
boost::mpl::max_element<
boost::mpl::vector<
boost::mpl::int_<boost::alignment_of<signed char>::value>::type,
boost::mpl::int_<boost::alignment_of<short int>::value>::type,
boost::mpl::int_<boost::alignment_of<int>::value>::type, boost::mpl::int_<boost::alignment_of<long int>::value>::type,
boost::mpl::int_<boost::alignment_of<float>::value>::type,
boost::mpl::int_<boost::alignment_of<double>::value>::type,
boost::mpl::int_<boost::alignment_of<long double>::value>::type,
boost::mpl::int_<boost::alignment_of<void*>::value>::type
>::type
>::type
>::type::value
};
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js