Determining maximum possible alignment in C++ - c++

Is there any portable way to determine what the maximum possible alignment for any type is?
For example on x86, SSE instructions require 16-byte alignment, but as far as I'm aware, no instructions require more than that, so any type can be safely stored into a 16-byte aligned buffer.
I need to create a buffer (such as a char array) where I can write objects of arbitrary types, and so I need to be able to rely on the beginning of the buffer to be aligned.
If all else fails, I know that allocating a char array with new is guaranteed to have maximum alignment, but with the TR1/C++0x templates alignment_of and aligned_storage, I am wondering if it would be possible to create the buffer in-place in my buffer class, rather than requiring the extra pointer indirection of a dynamically allocated array.
Ideas?
I realize there are plenty of options for determining the max alignment for a bounded set of types: A union, or just alignment_of from TR1, but my problem is that the set of types is unbounded. I don't know in advance which objects must be stored into the buffer.

In C++11 std::max_align_t defined in header cstddef is a POD type whose alignment requirement is at least as strict (as large) as that of every scalar type.
Using the new alignof operator it would be as simple as alignof(std::max_align_t)

In C++0x, the Align template parameter of std::aligned_storage<Len, Align> has a default argument of "default-alignment," which is defined as (N3225 §20.7.6.6 Table 56):
The value of default-alignment shall be the most stringent alignment requirement for any C++ object type whose size is no greater than Len.
It isn't clear whether SSE types would be considered "C++ object types."
The default argument wasn't part of the TR1 aligned_storage; it was added for C++0x.

Unfortunately ensuring max alignment is a lot tougher than it should be, and there are no guaranteed solutions AFAIK. From the GotW blog (Fast Pimpl article):
union max_align {
short dummy0;
long dummy1;
double dummy2;
long double dummy3;
void* dummy4;
/*...and pointers to functions, pointers to
member functions, pointers to member data,
pointers to classes, eye of newt, ...*/
};
union {
max_align m;
char x_[sizeofx];
};
This isn't guaranteed to be fully
portable, but in practice it's close
enough because there are few or no
systems on which this won't work as
expected.
That's about the closest 'hack' I know for this.
There is another approach that I've used personally for super fast allocation. Note that it is evil, but I work in raytracing fields where speed is one of the greatest measures of quality and we profile code on a daily basis. It involves using a heap allocator with pre-allocated memory that works like the local stack (just increments a pointer on allocation and decrements one on deallocation).
I use it for Pimpls particularly. However, just having the allocator is not enough; for such an allocator to work, we have to assume that memory for a class, Foo, is allocated in a constructor, the same memory is likewise deallocated only in the destructor, and that Foo itself is created on the stack. To make it safe, I needed a function to see if the 'this' pointer of a class is on the local stack to determine if we can use our super fast heap-based stack allocator. For that we had to research OS-specific solutions: I used TIBs and TEBs for Win32/Win64, and my co-workers found solutions for Linux and Mac OS X.
The result, after a week of researching OS-specific methods to detect stack range, alignment requirements, and doing a lot of testing and profiling, was an allocator that could allocate memory in 4 clock cycles according to our tick counter benchmarks as opposed to about 400 cycles for malloc/operator new (our test involved thread contention so malloc is likely to be a bit faster than this in single-threaded cases, perhaps a couple of hundred cycles). We added a per-thread heap stack and detected which thread was being used which increased the time to about 12 cycles, though the client can keep track of the thread allocator to get the 4 cycle allocations. It wiped out memory allocation based hotspots off the map.
While you don't have to go through all that trouble, writing a fast allocator might be easier and more generally applicable (ex: allowing the amount of memory to allocate/deallocate to be determined at runtime) than something like max_align here. max_align is easy enough to use, but if you're after speed for memory allocations (and assuming you've already profiled your code and found hotspots in malloc/free/operator new/delete with major contributors being in code you have control over), writing your own allocator can really make the difference.

Short of some maximally_aligned_t type that all compilers promised faithfully to support for all architectures everywhere, I don't see how this could be solved at compile time. As you say, the set of potential types is unbounded. Is the extra pointer indirection really that big a deal?

Allocating aligned memory is trickier than it looks - see for example Implementation of aligned memory allocation

This is what I'm using. In addition to this, if you're allocating memory then a new()'d array of char with length greater than or equal to max_alignment will be aligned to max_alignment so you can then use indexes into that array to get aligned addresses.
enum {
max_alignment = boost::mpl::deref<
boost::mpl::max_element<
boost::mpl::vector<
boost::mpl::int_<boost::alignment_of<signed char>::value>::type,
boost::mpl::int_<boost::alignment_of<short int>::value>::type,
boost::mpl::int_<boost::alignment_of<int>::value>::type, boost::mpl::int_<boost::alignment_of<long int>::value>::type,
boost::mpl::int_<boost::alignment_of<float>::value>::type,
boost::mpl::int_<boost::alignment_of<double>::value>::type,
boost::mpl::int_<boost::alignment_of<long double>::value>::type,
boost::mpl::int_<boost::alignment_of<void*>::value>::type
>::type
>::type
>::type::value
};
}

Related

Utilize memory past the end of a std::vector using a custom overallocating allocator

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.
Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?
Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.
To make things a little bit more clear:
Let's say I have two vectors and I want to add them:
std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);
If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like
for(size_t i=0u; i<(n+x); i+=y)
{ // where y is the number of doubles per register and and divisor of (n+x)
auto ma = _aligned_load(a.data() + i);
auto mb = _aligned_load(b.data() + i);
_aligned_store(c.data() + i, _simd_add(ma, mb));
}
where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.
But still the vectors only contain n values and can be handled like vectors of size n.
Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.
There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...
Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.
They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size1 won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").
To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.
Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.
Furthermore, when I was actually implementing custom allocators2 , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".
Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).
This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.
1 Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.
2 In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.
For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?
EDIT (addressing the code added to the question)
In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.
Consider the following possible layout of the vector's allocation:
---====================++++++++++------.........
=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator
You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Is std::make_unique<T[]> required to return aligned memory?

Is the memory owned by the unique pointer array_ptr:
auto array_ptr = std::make_unique<double[]>(size);
aligned to a sizeof(double) alignof(double) boundary (i.e. is it required by the std to be correctly aligned)?
Is the first element of the array the first element of a cache line?
Otherwise: what is the correct way of achieving this in C++14?
Motivation (update): I plan to use SIMD instructions on the array and since cache lines are the basic unit of memory on every single architecture that I know of I'd rather just allocate memory correctly such that the first element of the array is at the beginning of a cache line. Note that SIMD instructions work as long as the elements are correctly aligned (independently of the position of the elements between cache lines). However, I don't know if that has an influence at all but I can guess that yes, it does. Furthermore, I want to use these SIMD instructions on my raw memory inside a kernel. It is an optimization detail of a kernel so I don't want to allocate e.g. __int128 instead of int.
All objects that you obtain "normally" are suitably aligned, i.e. aligned at alignof(T) (which need not be the same as sizeof(T). That includes dynamic arrays. (Typically, the allocator ::operator new will just return a maximally aligned address so as not to have to worry about how the memory is used.)
There are no cache lines in C++. This is a platform specific issue that you need to deal with yourself (but alignas may help).
Try alignas plus a static check if it works (since support for over-aligned types is platform dependent), otherwise just add manual padding. You don't really care whether your data is at the beginning of a cache line, only that no two data elements are on the same cache line.
It is worth stressing that alignment isn't actually a concept you can check directly in C++, since pointers are not numbers. They are convertible to numbers, but the conversion is not generally meaningful other than being reversible. You need something like std::align to actually say "I have aligned memory", or just use alignas on your types directly.

Does an allocation hint get used?

I was reading Why is there no reallocation functionality in C++ allocators? and Is it possible to create an array on the heap at run-time, and then allocate more space whenever needed?, which clearly state that reallocation of a dynamic array of objects is impossible.
However, in The C++ Standard Library by Josuttis, it states an Allocator, allocator, has a function allocate with the following syntax
pointer allocator::allocate(size_type num, allocator<void>::pointer hint = 0)
where the hint has an implementation defined meaning, which may be used to help improve performance.
Are there any implementations that take advantage of this?
I have gained significant performance advantages for iteration times on small scalar types in my plf::colony c++ container using hints with std::allocator under Visual Studio 2010-2013 (iteration speed increased by ~21%), and much smaller speedups under GCC 5.1. So it's safe to say that with those compilers and std::allocator, it makes a difference. But the difference will be compiler-dependent. I am not aware of the ratio of hint-ignoring to hint-observing allocators.
I'm not sure about specific implementations, but note that the allocator isn't allowed to return the hint pointer value before it's been passed to deallocate. So that can't be used as a primitive operation to form a reallocate.
The Standard says the hint must have been returned by a previous call to allocate. It says "The use of [the hint] is unspecified, but it is
intended as an aid to locality." So if you're allocating and releasing a sequence of similar-sized blocks on one thread, you might pass the previously-freed value to avoid cache contention between microprocessor caches.
Otherwise, when CPU B sees that you're using memory addresses still in CPU A's cache (even that memory contains objects that were destroyed according to C++), it must forward the junk data over the bus. Better to let CPU A and B each reuse their own respective cached addresses.
C++11 states, in 20.6.9.1 allocator members:
4 - [ Note: In a container member function, the address of an adjacent element is often a good choice to pass for the hint argument. — end note ]
[...]
6 - [...] The use of hint is unspecified, but intended as an aid
to locality if an implementation so desires.
Allocating new elements adjacent or close to existing elements in memory can aid performance by improving locality; because they are usually cached together, nearby elements will tend to travel together up the memory hierarchy and will not evict each other.

stl vector size on 64-bit machines

I have an application that will use millions of vectors.
It appears that most implementations of std::vector use 4 pointers (_First, _Last, _End, and _Alloc), which consumes 32 bytes on 64-bit machines. For most "practical" use cases of vector, one could probably get away with a single pointer and two 'unsigned int' fields to store the current size & allocated size, respectively. Ignoring the potential challenge of supporting customized allocation (instead of assuming that allocations must go through the global new & delete operator), it seems that it is possible to build an STL compliant vector class that uses only 16 bytes (or at worst 24 bytes to support the _Alloc pointer).
Before I start coding this up, 1) are there any pitfalls I should be aware of and 2) does an open source implementation exist?
You could accomplish something like this -- but it's not likely you'd gain all that much.
First, there's the performance aspect. You are trading time for memory consumption. Whatever memory you save is going to be offset by having to do the addition and a multiply on every call to end (okay, if it's a vector where sizeof(vector<t>::value_type) == 1 the multiply can be optimized out). Note that most handwritten looping code over vectors calls end on every loop iteration. On modern CPUs that's actually going to be a major win, because it allows the processor to keep more things in cache; unless those couple of extra instructions in an inner loop force the processor to swap things in the instruction cache too often)
Moreover, the memory savings is likely to be small in terms of the overall memory use in the vector, for the following reasons:
Memory manager overhead. Each allocation from the memory manager, (which vector of course needs) is going to add 16-24 bytes of overhead on its own in most memory manager implementations. (Assuming something like dlmalloc (UNIX/Linux/etc.) or RtlHeap (Windows))
Overprovisioning load. In order to achieve amortized constant insertion and removal at the end, when vector resizes, it resizes to some multiple of the size of the data in the vector. This means that the typical memory capacity vector allocates is enough for 1.6 (MSVC++) or 2 (STLPort, libstdc++) times the number of elements actually stored in the vector.
Alignment restrictions. If you are putting those many vectors into an array (or another vector), then keep in mind the first member of that vector is still a pointer to the allocated memory block. This pointer generally needs to be 8 byte aligned anyway -- so the 4 bytes you save are lost to structure padding in arrays.
I'd use the plain implementation of vector for now. If you run your code through a memory profiler and find that a significant savings would be made by getting rid of these couple of pointers, then you're probably off implementing your own optimized class which meets your performance characteristics rather than relying on the built in vector implementation. (An example of one such optimized class is std::string on those platforms that implement the small string optimization)
(Note: the only compiler of which I am aware that optimizes out the Alloc pointer is VC11, which isn't yet released. Though Nim says that the current prerelease version of libstdc++ does it as well...)
Unless these vectors are going to have contents that are extremely small, the difference between 16 vs. 32 bytes to hold the contents will be a small percentage of the total memory consumed by them. It will require lots of effort to reinvent this wheel, so be sure you're getting an adequate pay off for all that work.
BTW, there's value in education too, and you will learn a lot by doing this. If you choose to proceed, you might consider writing a test suite first, and exercise it on the current implementation and then on the one you invent.
To answer if it is worth the effort, find or write a compatible implementation that fits your needs (maybe there are other things in std::vector that you do not need), and compare the performance with std::vector<your_type> on relevant platforms. Your suggestion can at least improve performance of for the move constructor, as well as the move assignment operator:
typedef int32_t v4si __attribute__ ((vector_size (16)));
union
{
v4si data;
struct
{
T* pointer;
uint32_t length;
uint32_t capacity;
} content;
} m_data;
This only covers "sane" T (noexcept move semantices). https://godbolt.org/g/d5yU3o

C++ Memory alignment in custom stack allocator

Usually data is aligned at power of two addresses depending on its size.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
I'm creating a custom stack allocator so I guess that the compiler wont align data for me since I'm working with a continuous block of memory.
Some more context:
I have an Allocator class that uses malloc() to allocate a large amount of data.
Then I use void* allocate(U32 size_of_object) method to return the pointer that where I can store whether objects I need to store.
This way all objects are stored in the same region of memory and it will hopefully fit in the cache reducing cache misses.
C++11 has the alignof operator specifically for this purpose. Don't use any of the tricks mentioned in other posts, as they all have edge cases or may fail for certain compiler optimisations. The alignof operator is implemented by the compiler and knows the exact alignment being used.
See this description of c++11's new alignof operator
Although the compiler (or interpreter) normally allocates individual data items on aligned boundaries, data structures often have members with different alignment requirements. To maintain proper alignment the translator normally inserts additional unnamed data members so that each member is properly aligned. In addition the data structure as a whole may be padded with a final unnamed member. This allows each member of an array of structures to be properly aligned. http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86
This says that the compiler takes care of it for you, 99.9% of the time. As for how to force an object to align a specific way, that is compiler specific, and only works in certain circumstances.
MSVC: http://msdn.microsoft.com/en-us/library/83ythb65.aspx
__declspec(align(20))
struct S{ int a, b, c, d; };
//must be less than or equal to 20 bytes
GCC: http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Type-Attributes.html
struct S{ int a, b, c, d; }
__attribute__ ((aligned (20)));
I don't know of a cross-platform way (including macros!) to do this, but there's probably neat macro somewhere.
Unless you want to access memory directly, or squeeze maximum data in a block of memory you don't worry about alignment -- the compiler takes case of that for you.
Due to the way processor data buses work, what you want to avoid is 'mis-aligned' access. Usually you can read a 32 bit value in a single access from addresses which are multiples of four; if you try to read it from an address that's not such a multiple, the CPU may have to grab it in two or more pieces. So if you're really worrying about things at this level of detail, what you need to be concerned about is not so much the overall struct, as the pieces within it. You'll find that compilers will frequently pad out structures with dummy bytes to ensure aligned access, unless you specifically force them not to with a pragma.
Since you've now added that you actually want to write your own allocator, the answer is straight-forward: Simply ensure that your allocator returns a pointer whose value is a multiple of the requested size. The object's size itself will already come suitably adjusted (via internal padding) so that all member objects themselves are properly aligned, so if you request sizeof(T) bytes, all your allocator needs to do is to return a pointer whose value is divisible by sizeof(T).
If your object does indeed have size 20 (as reported by sizeof), then you have nothing further to worry about. (On a 64-bit platform, the object would probably be padded to 24 bytes.)
Update: In fact, as I only now came to realize, strictly speaking you only need to ensure that the pointer is aligned, recursively, for the largest member of your type. That may be more efficient, but aligning to the size of the entire type is definitely not getting it wrong.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
Alignment is CPU-specific, so there is no answer to this question without, at least, knowing the target CPU.
Generally speaking, alignment isn't something that you have to worry about; your compiler will have the rules implemented for you. It does come up once in a while, like when writing an allocator. The classic solution is discussed in The C Programming Language (K&R): use the worst possible alignment. malloc does this, although it's phrased as, "the pointer returned if the allocation succeeds shall be suitably aligned so that it may be assigned to a pointer to any type of object."
The way to do that is to use a union (the elements of a union are all allocated at the union's base address, and the union must therefore be aligned in such a way that each element could exist at that address; i.e., the union's alignment will be the same as the alignment of the element with the strictest rules):
typedef Align long;
union header {
// the inner struct has the important bookeeping info
struct {
unsigned size;
header* next;
} s;
// the align member only exists to make sure header_t's are always allocated
// using the alignment of a long, which is probably the worst alignment
// for the target architecture ("worst" == "strictest," something that meets
// the worst alignment will also meet all better alignment requirements)
Align align;
};
Memory is allocated by creating an array (using somthing like sbrk()) of headers large enough to satisfy the request, plus one additional header element that actually contains the bookkeeping information. If the array is called arry, the bookkeeping information is at arry[0], while the pointer returned points at arry[1] (the next element is meant for walking the free list).
This works, but can lead to wasted space ("In Sun's HotSpot JVM, object storage is aligned to the nearest 64-bit boundary"). I'm aware of a better approach that tries to get a type-specific alignment instead of "the alignment that will work for anything."
Compilers also often have compiler-specific commands. They aren't standard, and they require that you know the correct alignment requirements for the types in question. I would avoid them.