Variable Length Array overhead in C++?

Variable Length Array overhead in C++? - c++

Looking at this question: Why does a C/C++ compiler need know the size of an array at compile time ? it came to me that compiler implementers should have had some times to get their feet wet now (it's part of C99 standard, that's 10 years ago) and provide efficient implementations.
However it still seems (from the answers) to be considered costly.
This somehow surprises me.
Of course, I understand that a static offset is much better than a dynamic one in terms of performance, and unlike one suggestion I would not actually have the compiler perform a heap allocation of the array since this would probably cost even more [this has not been measured ;)]
But I am still surprised at the supposed cost:
if there is no VLA in a function, then there would not be any cost, as far I can see.
if there is one single VLA, then one can either put it before or after all the variables, and therefore get a static offset for most of the stack frame (or so it seems to me, but I am not well-versed in stack management)
The question arise of multiple VLAs of course, and I was wondering if having a dedicated VLA stack would work. This means than a VLA would be represented by a count and a pointer (of known sizes therefore) and the actual memory taken in an secondary stack only used for this purpose (and thus really a stack too).
[rephrasing]
How VLAs are implemented in gcc / VC++ ?
Is the cost really that impressive ?
[end rephrasing]
It seems to me it can only be better than using, say, a vector, even with present implementations, since you do not incur the cost of a dynamic allocation (at the cost of not being resizable).
EDIT:
There is a partial response here, however comparing VLAs to traditional arrays seem unfair. If we knew the size beforehand, then we would not need a VLA. In the same question AndreyT gave some pointers regarding the implementation, but it's not as precise as I would like.

How VLAs are implemented in gcc / VC++ ?
AFAIK VC++ doesn't implement VLA. It's a C++ compiler and it supports only C89 (no VLA, no restrict). I don't know how gcc implements VLAs but the fastest possible way is to store the pointer to the VLA and its size in the static portion of the stack-frame. This way you can access one of the VLAs with performance of a constant-sized array (it's the last VLA if the stack grows downwards like in x86 (dereference [stack pointer + index*element size + the size of last temporary pushes]), and the first VLA if it grows upwards (dereference [stackframe pointer + offset from stackframe + index*element size])). All the other VLAs will need one more indirection to get their base address from the static portion of the stack.
[ Edit: Also when using VLA the compiler can't omit stack-frame-base pointer, which is redundant otherwise, because all the offsets from the stack pointer can be calculated during compile time. So you have one less free register. — end edit ]
Is the cost really that impressive ?
Not really. Moreover, if you don't use it, you don't pay for it.
[ Edit: Probably a more correct answer would be: Compared to what? Compared to a heap allocated vector, the access time will be the same but the allocation and deallocation will be faster. — end edit ]

If it were to be implemented in VC++, I would assume the compiler team would use some variant of _alloca(size). And I think the cost is equivalent to using variables with greater than 8-byte alignment on the stack (such as __m128); the compiler has to store the original stack pointer somewhere, and aligning the stack requires an extra register to store the unaligned stack.
So the overhead is basically an extra indirection (you have to store the address of VLA somewhere) and register pressure due to storing the original stack range somewhere as well.

Related

Why is it impossible to allocate an array of an arbitrary size on the stack?

Why can't I write the following?
char acBuf[nSize];
Only to prevent the stack from overgrowing?
Or is there a possibility to do something similar, if I can ensure that I always take just a few hundred kilobytes?
As far as I know, the std::string uses the memory of its members to store the assigned strings, as long as they are 15 characters or less. Only if the strings are longer, it uses this memory to store the address of some heap-allocated memory, which then takes the data.
It seems like it has to be 100%ly determined, during compile-time, how the stack will be aligned during runtime. Is that true? Why is that?

It has nothing to do with preventing stack overflow, you can overflow the stack just fine with char a[SOME_LARGE_CONSTANT]. In C++ the array size has to be known at compile time, this is among other things needed to compute the size of structures containing arrays.
C on the other hand had Variable Length Arrays since C99, which adds an exception and allow runtime dependant size for arrays within function scope. As to why C++ does not have this? It was never adopted by a C++ standard.

Why can't I write the following?
char acBuf[nSize];
Those are called Variable Length Arrays (VLA's) and aren't supported by C++. The reason being that the stack is very fast but tiny compared to the free store (the heap in your words). Which means that at any moment that you add lots of elements to a VLA your stack might just overflow and you get a vague runtime exception. This can also happen with compile-time sized stack arrays but these are way easier to catch because the behaviour of the program doesn't influence their size. Which means that x doesn't have to happen after y to create a stack overflow, it's just there right off the bat. This covers it in more detail and rage.
Containers like std::vector use the free store which is way bigger and has a way to deal with over-allocation (throws bad_alloc).

Unlike C, C++ doesn't support variable length arrays. If you want them, you can use non-standard extensions such as alloca or GNU extensions (supported by clang and GCC). They have their caveats, so be sure to read the manual to make sure you use them safely.
The reason the stack layout is mostly determined statically is so that the generated code has to perform fewer computations (additions, multiplications, and pointer dereferencing) to figure out where the data is on the stack. The offsets can instead be hardcoded into the generated machine code.

My advice is to take a look at alloca.h
void *alloca(size_t size);
The alloca() function allocates size bytes of space in the stack
frame of the caller. This temporary space is automatically freed
when the function that called alloca() returns to its caller.

One possible problem I see with VLA in C++ is the type.
What is the type of acBuf in char acBuf[nSize] or even worse in char acBuf[nSize][nSize] ?
template <typename T> void foo(const T&);
void foo(int n)
{
char mat[n][n];
foo(mat);
}
You cannot pass that array by reference to
template <typename T, std::size_t N>
void foo_on_array(const T (&a)[N]);

You should be happy that the C++ standard discourages a dangerous practice (variable-length arrays on the stack) and instead encourages a less dangerous practice (variable-length arrays and std::vector with heap allocation).
Variable-length arrays on the stack are more dangerous because:
The available stack space is typically 8 MB, much smaller than the 2 GB (or more) of available heap space.
When the stack space is exhausted, the program crashes with SIGSEGV, and it requires special software such as GNU libsigsegv to recover from such a situation.
In typical C++ programs, the programmer does not know whether the array length will definitely stay under a limit such as 4 MB.

Why can't I write the following?
char acBuf[nSize];
You can't do that because in C++ the lenght of the array has to be known at compile time, that's because the compiler reserves the specified memory for the array and it can not be modified during runtime. it's not about prevent a stack overflow, it's about memory layout.
If you want to make a dynamic array you should use the new operator so it will be stored in heap.
char *acBuf = new char[nsize];

strict aliasing and memory alignment

I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.

First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.
Your approach to pack all the arrays within one struct also makes sense, that's what I'd do.
It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT at debug build.
As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).
One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).
Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union. That is, you'll have one big struct, inside which some sub-structs will be grouped into union.

There are a number of issues here.
Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc returns 16-byte aligned memory.)
Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.
Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.
Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.
Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;, I would expect the optimizer to know that a, b, and c are different arrays, and therefore c[i] cannot be the same as a[i], so it is okay to eliminate a[i] = 3;. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?
In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.
If you have other concerns about optimization and register use, you should be more specific about them.
Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.

Just disable alias-based optimization and call it a day
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)

32 byte alignment sounds as if you are pushing the button too far. No CPU instruction should require an alignement as large as that. Basically an alignement as wide as the largest data type of your architecture should suffice.
C11 has the concept fo maxalign_t, which is a dummy type of maximum alignment for the architecture. If your compiler doesn't have it, yet, you can easily simulate it by something like
union maxalign0 {
long double a;
long long b;
... perhaps a 128 integer type here ...
};
typedef union maxalign1 maxalign1;
union maxalign1 {
unsigned char bytes[sizeof(union maxalign0)];
union maxalign0;
}
Now you have a data type that has the maximal alignment of your platform and that is default initialized with all bytes set to 0.
maxalign1 history_[someSize];
short * history = history_.bytes;
This avoids the awful address computations that you do currently, you'd only have to do some adoption of someSize to take into account that you always allocate multiples of sizeof(maxalign1).
Also be asured that this has no aliasing problems. First of all unions in C made for this, and then character pointers (of any version) are always allowed to alias any other pointer.

Is the size of a dynamically-allocated array stored somewhere?

It seems to me that delete[] knows the size of a dynamic allocated array. My question is: Is there any way to get it out so that we don't need to provide the size explicitly when coding.

The method used by delete[] to figure out how many items it has to deal with is implementation dependent. You can't get to it or use it in any way.

Read C++ FAQ [16.14] After p = new Fred[n], how does the compiler know there are n objects to be destructed during delete[] p? (and the whole section for a general idea on free store management.)

My question is: Is there any way to get it out so that we don't need to provide the size explicitly when coding.
You don't need to, you just call delete [], with no size.
The way the compiler stores the size is an implementation detail and no specified. Most store it in some memory right before the array starts (not after, as others mentioned).
See this related question : How does delete[] "know" the size of the operand array?

Edited:
Since delete [] needs to call destructors for all elements of the array, the length must definitely be stored somewhere. As of why this memory is not accessible to prevent errors such as walking outside of the array due to its unknown size - I am not really sure. Strictly speaking, the length of statically allocated arrays must be known during compile time and the length of dynamically allocated arrays must be stored by the runtime, so in both cases buffer overflow errors are theoretically 100% preventable, and yet both static and dynamic arrays are unsafe. My guess is this is for performance purposes, bounds checking will make it slower and raw (C style) arrays offer best performance at zero safety.
The implementation of this varies with compiler and runtime vendors, there might be some implementations it may be accessible and usable, but it wouldn't be considered standard and recommended practice. The logical place for the length to be stored is somewhere in the header of the allocated memory fragment before the actual address you will get for the first element of the array.

The C++ compiler has the size of dynamically allocated arrays buried deep somewhere; however, this is not accessible in any way while coding in C++ - so you'll have to store the size somewhere after allocation.
[Edit]: Though some versions of the Visual Studio compiler suite stores the size at the index -1, this is not to be trusted across compilers, or to be used at all when coding.

I think it is compiler dependent and you cannot get it for your application to use. Following link shows 2 methods compiler uses.
http://www.parashift.com/c++-faq-lite/compiler-dependencies.html#faq-38.7
http://www.parashift.com/c++-faq-lite/compiler-dependencies.html#faq-38.8

Compilers follow different approaches to store the memory allocated on new.
This is one of the approach, I read somewhere.
When compiler allocates memory based on a new call, it sets apart one extra byte, maybe in the beginning, where it will store how much memory was allocated.
So when it encounters a delete call, it will use this stored value to decide how much memory has to be de-allocated.

Why does a C/C++ compiler need know the size of an array at compile time?

I know C standards preceding C99 (as well as C++) says that the size of an array on stack must be known at compile time. But why is that? The array on stack is allocated at run-time. So why does the size matter in compile time? Hope someone explain to me what a compiler will do with size at compile time. Thanks.
The example of such an array is:
void func()
{
/*Here "array" is a local variable on stack, its space is allocated
*at run-time. Why does the compiler need know its size at compile-time?
*/
int array[10];
}

To understand why variably-sized arrays are more complicated to implement, you need to know a little about how automatic storage duration ("local") variables are usually implemented.
Local variables tend to be stored on the runtime stack. The stack is basically a large array of memory, which is sequentially allocated to local variables and with a single index pointing to the current "high water mark". This index is the stack pointer.
When a function is entered, the stack pointer is moved in one direction to allocate memory on the stack for local variables; when the function exits, the stack pointer is moved back in the other direction, to deallocate them.
This means that the actual location of local variables in memory is defined only with reference to the value of the stack pointer at function entry1. The code in a function must access local variables via an offset from the stack pointer. The exact offsets to be used depend upon the size of the local variables.
Now, when all the local variables have a size that is fixed at compile-time, these offsets from the stack pointer are also fixed - so they can be coded directly into the instructions that the compiler emits. For example, in this function:
void foo(void)
{
int a;
char b[10];
int c;
a might be accessed as STACK_POINTER + 0, b might be accessed as STACK_POINTER + 4, and c might be accessed as STACK_POINTER + 14.
However, when you introduce a variably-sized array, these offsets can no longer be computed at compile-time; some of them will vary depending upon the size that the array has on this invocation of the function. This makes things significantly more complicated for compiler writers, because they must now write code that accesses STACK_POINTER + N - and since N itself varies, it must also be stored somewhere. Often this means doing two accesses - one to STACK_POINTER + <constant> to load N, then another to load or store the actual local variable of interest.
1. In fact, "the value of the stack pointer at function entry" is such a useful value to have around, that it has a name of its own - the frame pointer - and many CPUs provide a separate register dedicated to storing the frame pointer. In practice, it is usually the frame pointer from which the location of local variables is calculated, rather than the stack pointer itself.

It is not an extremely complicated thing to support, so the reason C89 doesn't allow this is not because it was impossible back then.
There are however two important reasons why it is not in C89:
The runtime code will get less efficient if the array size is not known at compile-time.
Supporting this makes life harder for compiler writers.
Historically, it has been very important that C compilers should be (relatively) easy to write. Also, it should be possible to make compilers simple and small enough to run on modest computer systems (by 80s standards). Another important feature of C is that the generated code should consistently be extremely efficient, without any surprises,
I think it is unfortunate that these values no longer hold for C99.

The compiler has to generate the code to create the space for the frame on the stack to hold the array and other local local variables. For this it needs the size of the array.

Depends on how you allocate the array.
If you create it as a local variable, and specify a length, then it matters because the compiler needs to know how much space to allocate on the stack for the elements of the array. If you don't specify a size of the array, then it doesn't know how much space to set aside for the array elements.
If you create just a pointer to an array, then all you need to do is allocate the space for the pointer itself, and then you can dynamically create array elements during run time. But in this form of array creation, you're allocating space for the array elements in the heap, not on the stack.

In C++ this becomes even more difficult to implement, because variables stored on the stack must have their destructors called in the event of an exception, or upon returning from a given function or scope. Keeping track of the exact number/size of variables to be destroyed adds additional overhead and complexity. While in C it's possible to use something like a frame pointer to make freeing of the VLA implicit, in C++ that doesn't help you, because those destructors need to be called.
Also, VLAs can cause Denial of Service security vulnerabilities. If the user is able to supply any value which is eventually used as the size for a VLA, then they could use a sufficiently large value to cause a stack overflow (and therefore failure) in your process.
Finally, C++ already has a safe and effective variable length array (std::vector<t>), so there's little reason to implement this feature for C++ code.

Let's say you create a variable sized array on the stack. The size of the stack frame needed by the function won't be known at compile time. So, C assumed that some run time environments require this to be known up front. Hence the limitation. C goes back to the early 1970's. Many languages at the time had a "static" look and feel (like Fortran)

Determining maximum possible alignment in C++

Is there any portable way to determine what the maximum possible alignment for any type is?
For example on x86, SSE instructions require 16-byte alignment, but as far as I'm aware, no instructions require more than that, so any type can be safely stored into a 16-byte aligned buffer.
I need to create a buffer (such as a char array) where I can write objects of arbitrary types, and so I need to be able to rely on the beginning of the buffer to be aligned.
If all else fails, I know that allocating a char array with new is guaranteed to have maximum alignment, but with the TR1/C++0x templates alignment_of and aligned_storage, I am wondering if it would be possible to create the buffer in-place in my buffer class, rather than requiring the extra pointer indirection of a dynamically allocated array.
Ideas?
I realize there are plenty of options for determining the max alignment for a bounded set of types: A union, or just alignment_of from TR1, but my problem is that the set of types is unbounded. I don't know in advance which objects must be stored into the buffer.

In C++11 std::max_align_t defined in header cstddef is a POD type whose alignment requirement is at least as strict (as large) as that of every scalar type.
Using the new alignof operator it would be as simple as alignof(std::max_align_t)

In C++0x, the Align template parameter of std::aligned_storage<Len, Align> has a default argument of "default-alignment," which is defined as (N3225 §20.7.6.6 Table 56):
The value of default-alignment shall be the most stringent alignment requirement for any C++ object type whose size is no greater than Len.
It isn't clear whether SSE types would be considered "C++ object types."
The default argument wasn't part of the TR1 aligned_storage; it was added for C++0x.

Unfortunately ensuring max alignment is a lot tougher than it should be, and there are no guaranteed solutions AFAIK. From the GotW blog (Fast Pimpl article):
union max_align {
short dummy0;
long dummy1;
double dummy2;
long double dummy3;
void* dummy4;
/*...and pointers to functions, pointers to
member functions, pointers to member data,
pointers to classes, eye of newt, ...*/
};
union {
max_align m;
char x_[sizeofx];
};
This isn't guaranteed to be fully
portable, but in practice it's close
enough because there are few or no
systems on which this won't work as
expected.
That's about the closest 'hack' I know for this.
There is another approach that I've used personally for super fast allocation. Note that it is evil, but I work in raytracing fields where speed is one of the greatest measures of quality and we profile code on a daily basis. It involves using a heap allocator with pre-allocated memory that works like the local stack (just increments a pointer on allocation and decrements one on deallocation).
I use it for Pimpls particularly. However, just having the allocator is not enough; for such an allocator to work, we have to assume that memory for a class, Foo, is allocated in a constructor, the same memory is likewise deallocated only in the destructor, and that Foo itself is created on the stack. To make it safe, I needed a function to see if the 'this' pointer of a class is on the local stack to determine if we can use our super fast heap-based stack allocator. For that we had to research OS-specific solutions: I used TIBs and TEBs for Win32/Win64, and my co-workers found solutions for Linux and Mac OS X.
The result, after a week of researching OS-specific methods to detect stack range, alignment requirements, and doing a lot of testing and profiling, was an allocator that could allocate memory in 4 clock cycles according to our tick counter benchmarks as opposed to about 400 cycles for malloc/operator new (our test involved thread contention so malloc is likely to be a bit faster than this in single-threaded cases, perhaps a couple of hundred cycles). We added a per-thread heap stack and detected which thread was being used which increased the time to about 12 cycles, though the client can keep track of the thread allocator to get the 4 cycle allocations. It wiped out memory allocation based hotspots off the map.
While you don't have to go through all that trouble, writing a fast allocator might be easier and more generally applicable (ex: allowing the amount of memory to allocate/deallocate to be determined at runtime) than something like max_align here. max_align is easy enough to use, but if you're after speed for memory allocations (and assuming you've already profiled your code and found hotspots in malloc/free/operator new/delete with major contributors being in code you have control over), writing your own allocator can really make the difference.

Short of some maximally_aligned_t type that all compilers promised faithfully to support for all architectures everywhere, I don't see how this could be solved at compile time. As you say, the set of potential types is unbounded. Is the extra pointer indirection really that big a deal?

Allocating aligned memory is trickier than it looks - see for example Implementation of aligned memory allocation

This is what I'm using. In addition to this, if you're allocating memory then a new()'d array of char with length greater than or equal to max_alignment will be aligned to max_alignment so you can then use indexes into that array to get aligned addresses.
enum {
max_alignment = boost::mpl::deref<
boost::mpl::max_element<
boost::mpl::vector<
boost::mpl::int_<boost::alignment_of<signed char>::value>::type,
boost::mpl::int_<boost::alignment_of<short int>::value>::type,
boost::mpl::int_<boost::alignment_of<int>::value>::type, boost::mpl::int_<boost::alignment_of<long int>::value>::type,
boost::mpl::int_<boost::alignment_of<float>::value>::type,
boost::mpl::int_<boost::alignment_of<double>::value>::type,
boost::mpl::int_<boost::alignment_of<long double>::value>::type,
boost::mpl::int_<boost::alignment_of<void*>::value>::type
>::type
>::type
>::type::value
};
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js