What is the recommended way to align memory in C++11

What is the recommended way to align memory in C++11 - c++

I am working on a single producer single consumer ring buffer implementation.I have two requirements:
Align a single heap allocated instance of a ring buffer to a cache line.
Align a field within a ring buffer to a cache line (to prevent false sharing).
My class looks something like:
#define CACHE_LINE_SIZE 64 // To be used later.
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
....
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_; // This needs to be aligned to a cache line.
};
Let me first tackle point 1 i.e. aligning a single heap allocated instance of the class. There are a few ways:
Use the c++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class alignas(CACHE_LINE_SIZE) RingBuffer {
public:
....
private:
// All the private fields.
};
Use posix_memalign(..) + placement new(..) without altering the class definition. This suffers from not being platform independent:
void* buffer;
if (posix_memalign(&buffer, 64, sizeof(processor::RingBuffer<int, kRingBufferSize>)) != 0) {
perror("posix_memalign did not work!");
abort();
}
// Use placement new on a cache aligned buffer.
auto ring_buffer = new(buffer) processor::RingBuffer<int, kRingBufferSize>();
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer {
public:
....
private:
// All the private fields.
} __attribute__ ((aligned(CACHE_LINE_SIZE)));
I tried to use the C++ 11 standardized aligned_alloc(..) function instead of posix_memalign(..) but GCC 4.8.1 on Ubuntu 12.04 could not find the definition in stdlib.h
Are all of these guaranteed to do the same thing? My goal is cache-line alignment so any method that has some limits on alignment (say double word) will not do. Platform independence which would point to using the standardized alignas(..) is a secondary goal.
I am not clear on whether alignas(..) and __attribute__((aligned(#))) have some limit which could be below the cache line on the machine. I can't reproduce this any more but while printing addresses I think I did not always get 64 byte aligned addresses with alignas(..). On the contrary posix_memalign(..) seemed to always work. Again I cannot reproduce this any more so maybe I was making a mistake.
The second aim is to align a field within a class/struct to a cache line. I am doing this to prevent false sharing. I have tried the following ways:
Use the C++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ alignas(CACHE_LINE_SIZE);
};
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ __attribute__ ((aligned (CACHE_LINE_SIZE)));
};
Both these methods seem to align consumer_sequence to an address 64 bytes after the beginning of the object so whether consumer_sequence is cache aligned depends on whether the object itself is cache aligned. Here my question is - are there any better ways to do the same?
EDIT:
The reason aligned_alloc did not work on my machine was that I was on eglibc 2.15 (Ubuntu 12.04). It worked on a later version of eglibc.
From the man page: The function aligned_alloc() was added to glibc in version 2.16.
This makes it pretty useless for me since I cannot require such a recent version of eglibc/glibc.

Unfortunately the best I have found is allocating extra space and then using the "aligned" part. So the RingBuffer new can request an extra 64 bytes and then return the first 64 byte aligned part of that. It wastes space but will give the alignment you need. You will likely need to set the memory before what is returned to the actual alloc address to unallocate it.
[Memory returned][ptr to start of memory][aligned memory][extra memory]
(assuming no inheritence from RingBuffer) something like:
void * RingBuffer::operator new(size_t request)
{
static const size_t ptr_alloc = sizeof(void *);
static const size_t align_size = 64;
static const size_t request_size = sizeof(RingBuffer)+align_size;
static const size_t needed = ptr_alloc+request_size;
void * alloc = ::operator new(needed);
void *ptr = std::align(align_size, sizeof(RingBuffer),
alloc+ptr_alloc, request_size);
((void **)ptr)[-1] = alloc; // save for delete calls to use
return ptr;
}
void RingBuffer::operator delete(void * ptr)
{
if (ptr) // 0 is valid, but a noop, so prevent passing negative memory
{
void * alloc = ((void **)ptr)[-1];
::operator delete (alloc);
}
}
For the second requirement of having a data member of RingBuffer also 64 byte aligned, for that if you know that the start of this is aligned, you can pad to force the alignment for data members.

The answer to your problem is std::aligned_storage. It can be used top level and for individual members of a class.

After some more research my thoughts are:
Like #TemplateRex pointed out there does not seem to be a standard way to align to more than 16 bytes. So even if we use the standardized alignas(..)there is no guarantee unless the alignment boundary is less than or equal to 16 bytes. I'll have to verify that it works as expected on a target platform.
__attribute ((aligned(#))) or alignas(..) cannot be used to align a heap allocated object as I suspected i.e. new() doesn't do anything with these annotations. They seem to work for static objects or stack allocations with the caveats from (1).
Either posix_memalign(..) (non standard) or aligned_alloc(..) (standardized but couldn't get it to work on GCC 4.8.1) + placement new(..) seems to be the solution. My solution for when I need platform independent code is compiler specific macros :)
Alignment for struct/class fields seems to work with both __attribute ((aligned(#))) and alignas() as noted in the answer. Again I think the caveats from (1) about guarantees on alignment stand.
So my current solution is to use posix_memalign(..) + placement new(..) for aligning a heap allocated instance of my class since my target platform right now is Linux only. I am also using alignas(..) for aligning fields since it's standardized and at least works on Clang and GCC. I'll be happy to change it if a better answer comes along.

I don't know if it is the best way to align memory allocated with a new operator, but it is certainly very simple !
This is the way it is done in thread sanitizer pass in GCC 6.1.0
#define ALIGNED(x) __attribute__((aligned(x)))
static char myarray[sizeof(myClass)] ALIGNED(64) ;
var = new(myarray) myClass;
Well, in sanitizer_common/sanitizer_internal_defs.h, it is also written
// Please only use the ALIGNED macro before the type.
// Using ALIGNED after the variable declaration is not portable!
So I do not know why the ALIGNED here is used after the variable declaration. But it is an other story.

Related

Is there any portable way to ensure a struct is defined without padding bytes using c++11? [duplicate]

This question already has answers here:
Compile-time check to make sure that there is no padding anywhere in a struct
(4 answers)
Closed 3 years ago.
Lets consider the following task:
My C++ module as part of an embedded system receives 8 bytes of data, like: uint8_t data[8].
The value of the first byte determines the layout of the rest (20-30 different). In order to get the data effectively, I would create different structs for each layout and put each to a union and read the data directly from the address of my input through a pointer like this:
struct Interpretation_1 {
uint8_t multiplexer;
uint8_t timestamp;
uint32_t position;
uint16_t speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
...
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << +interpreter->movement.position << "\n";
The problem I have is, the compiler can insert padding bytes to the interpretation structs and this kills my idea. I know I can use
with gcc: struct MyStruct{} __attribute__((__packed__));
with MSVC: I can use #pragma pack(push, 1) MyStruct{}; #pragma pack(pop)
with clang: ? (I could check it)
But is there any portable way to achieve this? I know c++11 has e.g. alignas for alignment control, but can I use it for this? I have to use c++11 but I would be just interested if there is a better solution with later version of c++.

But is there any portable way to achieve this?
No, there is no (standard) way to "make" a type that would have padding to not have padding in C++. All objects are aligned at least as much as their type requires and if that alignment doesn't match with the previous sub objects, then there will be padding and that is unavoidable.
Furthermore, there is another problem: You're accessing through a reinterpreted pointed that doesn't point to an object of compatible type. The behaviour of the program is undefined.
We can conclude that classes are not generally useful for representing arbitrary binary data. The packed structures are non-standard, and they also aren't compatible across different systems with different representations for integers (byte endianness).
There is a way to check whether a type contains padding: Compare the size of the sub objects to the size of the complete object, and do this recursively to each member. If the sizes don't match, then there is padding. This is quite tricky however because C++ has minimal reflection capabilities, so you need to resort either hard coding or meta programming.
Given such check, you can make the compilation fail on systems where the assumption doesn't hold.
Another handy tool is std::has_unique_object_representations (since C++17) which will always be false for all types that have padding. But note that it will also be false for types that contain floats for example. Only types that return true can be meaningfully compared for equality with std::memcmp.

Reading from unaligned memory is undefined behavior in C++. In other words, the compiler is allowed to assume that every uint32_t is located at a alignof(uint32_t)-byte boundary and every uint16_t is located at a alignof(uint16_t)-byte boundary. This means that if you somehow manage to pack your bytes portably, doing interpreter->movement.position will still trigger undefined behaviour.
(In practice, on most architectures, unaligned memory access will still work, but albeit incur a performance penalty.)
You could, however, write a wrapper, like how std::vector<bool>::operator[] works:
#include <cstdint>
#include <cstring>
#include <iostream>
#include <type_traits>
template <typename T>
struct unaligned_wrapper {
static_assert(std::is_trivial<T>::value);
std::aligned_storage_t<sizeof(T), 1> buf;
operator T() const noexcept {
T ret;
memcpy(&ret, &buf, sizeof(T));
return ret;
}
unaligned_wrapper& operator=(T t) noexcept {
memcpy(&buf, &t, sizeof(T));
return *this;
}
};
struct Interpretation_1 {
unaligned_wrapper<uint8_t> multiplexer;
unaligned_wrapper<uint8_t> timestamp;
unaligned_wrapper<uint32_t> position;
unaligned_wrapper<uint16_t> speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
int main(){
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << interpreter->movement.position << "\n";
}
This would ensure that every read or write to the unaligned integer is transformed to a bytewise memcpy, which does not have any alignment requirement. There might be a performance penalty for this on architectures with the ability to access unaligned memory quickly, but it would work on any conforming compiler.

C++ casting a struct to std::vector<char> memory alignment

I'm trying to cast a struct into a char vector.
I wanna send my struct casted in std::vector throw a UDP socket and cast it back on the other side. Here is my struct whith the PACK attribute.
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop) )
PACK(struct Inputs
{
uint8_t structureHeader;
int16_t x;
int16_t y;
Key inputs[8];
});
Here is test code:
auto const ptr = reinterpret_cast<char*>(&in);
std::vector<char> buffer(ptr, ptr + sizeof in);
//send and receive via udp
Inputs* my_struct = reinterpret_cast<Inputs*>(&buffer[0]);
The issue is:
All works fine except my uint8_t or int8_t.
I don't know why but whenever and wherever I put a 1Bytes value in the struct,
when I cast it back the value is not readable (but the others are)
I tried to put only 16bits values and it works just fine even with the
maximum values so all bits are ok.
I think this is something with the alignment of the bytes in the memory but i can't figure out how to make it work.
Thank you.

I'm trying to cast a struct into a char vector.
You cannot cast an arbitrary object to a vector. You can cast your object to an array of char and then copy that array into a vector (which is actually what your code is doing).
auto const ptr = reinterpret_cast<char*>(&in);
std::vector<char> buffer(ptr, ptr + sizeof in);
That second line defines a new vector and initializes it by copying the bytes that represent your object into it. This is reasonable, but it's distinct from what you said you were trying to do.
I think this is something with the alignment of the bytes in the memory
This is good intuition. If you hadn't told the compiler to pack the struct, it would have inserted padding bytes to ensure each field starts at its natural alignment. The fact that the operation isn't reversible suggests that somehow the receiving end isn't packed exactly the same way. Are you sure the receiving program has exactly the same packing directive and struct layout?
On x86, you can get by with unaligned data, but you may pay a large performance cost whenever you access an unaligned member variable. With the packing set to one, and the first field being odd-sized, you've guaranteed that the next fields will be unaligned. I'd urge you to reconsider this. Design the struct so that all the fields fall at their natural alignment boundaries and that you don't need to adjust the packing. This may make your struct a little bigger, but it will avoid all the alignment and performance problems.
If you want to omit the padding bytes in your wire format, you'll have to copy the relevant fields byte by byte into the wire format and then copy them back out on the receiving end.
An aside regarding:
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop) )
Identifiers that begin with underscore and a capital letter or with two underscores are reserved for "the implementation," so you probably shouldn't use __Declaration__ as the macro's parameter name. ("The implementation" refers to the compiler, the standard library, and any other runtime bits the compiler requires.)

1
vector class has dynamically allocated memory and uses pointers inside. So you can't send the vector (but you can send the underlying array)
2
SFML has a great class for doing this called sf::packet. It's free, open source, and cross-platform.
I was recently working on a personal cross platform socket library for use in other personal projects and I eventually quit it for SFML. There's just TOO much to test, I was spending all my time testing to make sure stuff worked and not getting any work done on the actual projects I wanted to do.
3
memcpy is your best friend. It is designed to be portable, and you can use that to your advantage.
You can use it to debug. memcpy the thing you want to see into a char array and check that it matches what you expect.
4
To save yourself from having to do tons of robustness testing, limit yourself to only chars, 32-bit integers, and 64-bit doubles. You're using different compilers? struct packing is compiler and architecture dependent. If you have to use a packed struct, you need to guarantee that the packing is working as expected on all platforms you will be using, and that all platforms have the same endianness. Obviously, that's what you're having trouble with and I'm sorry I can't help you more with that. I would I would recommend regular serializing and would definitely avoid struct packing if I was trying to make portable sockets.
If you can make those guarantees that I mentioned, sending is really easy on LINUX.
// POSIX
void send(int fd, Inputs& input)
{
int error = sendto(fd, &input, sizeof(input), ..., ..., ...);
...
}
winsock2 uses a char* instead of a void* :(
void send(int fd, Inputs& input)
{
char buf[sizeof(input)];
memcpy(buf, &input, sizeof(input));
int error = sendto(fd, buf, sizeof(input), ..., ..., ...);
...
}

Did you tried the most simple approach of:
unsigned char *pBuff = (unsigned char*)&in;
for (unsigned int i = 0; i < sizeof(Inputs); i++) {
vecBuffer.push_back(*pBuff);
pBuff++;
}
This would work for both, pack and non pack, since you will iterate the sizeof.

C++11 : Does new return contiguous memory?

float* tempBuf = new float[maxVoices]();
Will the above result in
1) memory that is 16-byte aligned?
2) memory that is confirmed to be contiguous?
What I want is the following:
float tempBuf[maxVoices] __attribute__ ((aligned));
but as heap memory, that will be effective for Apple Accelerate framework.
Thanks.

The memory will be aligned for float, but not necessarily for CPU-specific SIMD instructions. I strongly suspect on your system sizeof(float) < 16, though, which means it's not as aligned as you want. The memory will be contiguous: &A[i] == &A[0] + i.
If you need something more specific, new std::aligned_storage<Length, Alignment> will return a suitable region of memory, presuming of course that you did in fact pass a more specific alignment.
Another alternative is struct FourFloats alignas(16) {float[4] floats;}; - this may map more naturally to the framework. You'd now need to do new FourFloats[(maxVoices+3)/4].

Yes, new returns contiguous memory.
As for alignment, no such alignment guarantee is provided. Try this:
template<class T, size_t A>
T* over_aligned(size_t N){
static_assert(A <= alignof(std::max_align_t),
"Over-alignment is implementation-defined."
);
static_assert( std::is_trivially_destructible<T>{},
"Function does not store number of elements to destroy"
);
using Helper=std::aligned_storage_t<sizeof(T), A>;
auto* ptr = new Helper[(N+sizeof(Helper)-1)/sizeof(Helper)];
return new(ptr) T[N];
}
Use:
float* f = over_aligned<float,16>(37);
Makes an array of 37 floats, with the buffer aligned to 16 bytes. Or it fails to compile.
If the assert fails, it can still work. Test and consult your compiler documentation. Once convinced, put compiler-specific version guards around the static assert, so when you change compilers you can test all over again (yay).
If you want real portability, you have to have a fall back to std::align and manage resources separately from data pointers and count the number of T if and only if T has a non-trivial destructor, then store the number of T "before" the start of your buffer. It gets pretty silly.

It's guaranteed to be aligned properly with respect to the type you're allocating. So if it's an array of 4 floats (each supposedly 4 bytes), it's guaranteed to provide a usable sequence of floats. It's not guaranteed to be aligned to 16 bytes.
Yes, it's guaranteed to be contiguous (otherwise what would be the meaning of a single pointer?)
If you want it to be aligned to some K bytes, you can do it manually with std::align. See MSalter's answer for a more efficient way of doing this.

If tempBuf is not nullptr then the C++ standard guarantees that tempBuf points to the zeroth element of least maxVoices contiguous floats.
(Don't forget to call delete[] tempBuf once you're done with it.)

WinAPI _Interlocked* intrinsic functions for char, short

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function.
I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.
update 1
I'll try to simplify the question.
How can I make this work?
struct X
{
char data;
};
X atomic_exchange(X another)
{
return _InterlockedExchange( ??? );
}
I see two possible solutions
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
First one is obviously bad solution.
Second one looks better, but how to implement it?
update 2
What do you think about something like this?
template <typename T, typename U>
class padded_variable
{
public:
padded_variable(T v): var(v) {}
padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
U& cast()
{
return *static_cast<U*>(static_cast<void*>(&var));
}
T& get()
{
return var;
}
private:
T var;
char padding[sizeof(U) - sizeof(T)];
};
struct X
{
char data;
};
template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
var(): data(T()) {}
T atomic_exchange(T another)
{
padded_variable<T, long> xch(another);
padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
return res.get();
}
private:
padded_variable<T, long> data;
};
Thanks.

It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.
Example interlocked API implementation: intrin.asm

Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.
Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.

Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.
So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.
Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.

Creating a new answer because your edit changed things a bit:
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.
The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)
In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.
Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.
Does that cover all the cases you can encounter?
If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

G++ SSE memory alignment on the stack

I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.

The simplest way is std::aligned_storage, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.

A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.

Normally all you should need is:
union vector {
__m128 simd;
float raw[4];
};
i.e. no additional __attribute__ ((aligned (16))) required for the union itself.
This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases.

I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use.
The other answers contain good workarounds though.

If you need an array of N of these objects, allocate vector raw[N+1], and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. This will always be aligned.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js