Using std::atomic with aligned classes - c++

I have a mat4 class, a 4x4 matrix that uses sse intrinsics. This class is aligned using _MM_ALIGN16, because it stores the matrix as a set of __m128's. The problem is, when I declare an atomic<mat4>, my compiler yells at me:
f:\program files (x86)\microsoft visual studio 12.0\vc\include\atomic(504): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned
This is the same error I get when I try to pass any class aligned with _MM_ALIGN16 as an argument for a function (without using const &).
How can I declare an atomic version of my mat4 class?

The MSC compiler has never supported more than 4 bytes of alignment for parameters on the x86 stack, and there is no workaround.
You can verify this yourself by compiling,
struct A { __declspec(align(4)) int x; };
void foo(A a) {}
versus,
// won't compile, alignment guarantee can't be fulfilled
struct A { __declspec(align(8)) int x; };
versus,
// __m128d is naturally aligned, again - won't compile
struct A { __m128d x; };
Generally MSC is absolved by the following,
You cannot specify alignment for function parameters.
align (C++)
And you cannot specify the alignment, because MSC writers wanted to reserve the freedom to decide on the alignment,
The x86 compiler uses a different method for aligning the stack. By
default, the stack is 4-byte aligned. Although this is space
efficient, you can see that there are some data types that need to be
8-byte aligned, and that, in order to get good performance, 16-byte
alignment is sometimes needed. The compiler can determine, on some
occasions, that dynamic 8-byte stack alignment would be
beneficial—notably when there are double values on the stack.
The compiler does this in two ways. First, the compiler can use
link-time code generation (LTCG), when specified by the user at
compile and link time, to generate the call-tree for the complete
program. With this, it can determine regions of the call-tree where
8-byte stack alignment would be beneficial, and it determines
call-sites where the dynamic stack alignment gets the best payoff. The
second way is used when the function has doubles on the stack, but,
for whatever reason, has not yet been 8-byte aligned. The compiler
applies a heuristic (which improves with each iteration of the
compiler) to determine whether the function should be dynamically
8-byte aligned.
Windows Data Alignment on IPF, x86, and x64
Thus as long as you use MSC with the 32-bit platform toolset, this issue is unavoidable.
The x64 ABI has been explicit about the alignment, defining that non-trivial structures or structures over certain sizes are passed as a pointer parameter. This is elaborated in Section 3.2.3 of the ABI, and MSC had to implement this to be compatible with the ABI.
Path 1: Use another Windows compiler toolchain: GCC or ICC.
Path 2: Move to a 64-bit platform MSC toolset
Path 3: Reduce your use cases to std::atomic<T> with T=__m128d, because it will be possible to skip the stack and pass the variable in an XMM register directly.

The atomic<T> probably has a constructor which is passed a copy of T as a (formal) parameter. For example in the atomic header packaged with GCC 4.5 :
97: atomic(_Tp __i) : _M_i(__i) { }
This is problematic for exactly the same reason as any other function which has a memory aligned type as a parameter: It would be very complicated and slow for functions to keep track of memory aligned data on the stack.
Even if the compiler allowed it, this approach would incur a significant performance penalty. Assuming you are trying to optimise for speed I would implement a less fine grained memory access approach. Either locking access to a chunk of memory whilst performing a series of calculations, or explicitly designing your program so that threads never try and access the same piece of memory.

I faced a similar problem using Agner Fog's vectorclass in MSVC. The problem happens in 32-bit mode. If you compile in 64-bit mode release mode I don't think you will have this problem. In Windows and Unix all variables on the stack are aligned to 16 bytes in 64-bit mode but not necessarily in 32-bit mode. In his manual under compile time errors he writes
"error C2719: formal parameter with __declspec(align('16')) won't be aligned".
The Microsoft compiler cannot handle vectors as function parameters. The
easiest solution is to change the parameter to a const reference, e.g.:
Vec4f my_function(Vec4f const & x) {
... }
So if you use a const reference (as you mentioned) when you pass your class to a function it should work in 32-bit mode as well.
Edit: Based on this Self-contained, STL-compatible implementation of std::vector I think you can use a "thin wrapper". Something like.
template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};
struct __declspec(align(64)) mat4
{
//float x, y, z, w;
};
int main()
{
atomic< wrapper<mat4> > m; // OK, no C2719 error
return 0;
}

I don't profess to understand how __declspec(align(foo)) is supposed to work, but this standard C++ program compiles and runs fine in gcc & clang using alignas(16):
struct alignas(16) mat4 {
float some_floats[4][4];
};
std::atomic<mat4> am4;
static_assert(alignof(decltype(am4)) == 16,
"Jabberwocky is killing user.");
int main() {
static const mat4 foo = {{
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 }
}};
am4 = foo;
}

Related

Isn't __m128d aligned natively?

I've this code:
double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];
...
inline void AddIntrinsics(int voiceIndex, int blockSize) {
// assuming blockSize / 2 == 0 and voiceIndex is within the range
int iters = blockSize / 2;
__m128d *pA = (__m128d*)a;
__m128d *pB = (__m128d*)b[voiceIndex];
double *pC = c[voiceIndex];
for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
_mm_store_pd(pC, _mm_add_pd(*pA, *pB));
}
}
But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.
But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?
Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?
If so, how would you align arrays in C++ for this kind of stuff? std::align?
I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.
__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; 2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.

How to solve error C2719 in visual studio 2010 c++ when no code line exist in code

I am building a project I have received from a colleague. I am receiving the following error:
..\HelperFunctions\disp.cpp(130): error C2719: 'viewpoint': formal parameter with __declspec(align('16')) won't be aligned
I am following clues to disp.cpp line 130 only to find this is the end of a function and all I have in this line is:
}
Following this link it is my understanding this might be an issue with the function definition, but I could not fully understand if there is anything wrong. I have commented all unnecessary elements of the function and reduced it to:
std::vector< int > HPR (typename pcl::PointCloud<PointT>::ConstPtr source,pcl::PointXYZ viewpoint, double param)
{
//commented section
std::vector< int > indices;
//commented section
return indices;
}
Still getting the same error.
What am I missing?
How do I address this?
P.S.
I am new to C++ and working on visual studio 2010 with PCL API.
After googling the pcl::PointXYZ, I found out that it's actually based on the Eigen library. (I saw a lot of EIGEN macros in the source code.)
The EIGEN library tries to obtain best performance using special SSE instructions. AFAIK, these SSE instructions require that the data has to be aligned appropriately (e.g. that addresses are multiples of 16).
This may interfere with the passing of function arguments for
std::vector<int> HPR(
typename pcl::PointCloud<PointT>::ConstPtr source,
pcl::PointXYZ viewpoint, double param);
When a function is called the arguments may be handed over in CPU registers but usually (especially on Intels x86 CPUs) they are pushed to the stack where the function accesses them using a certain base pointer, for Intels CPUs e.g. obtaining the BP register (16 bit), EBP (32 bit), or RBP (64 bit).
More about this on Eli Benderskys Stack frame layout on x86-64.
However, pushing data to stack may not allow to align the data as required (without breaking the "binary signature" of the called function). Thus, the compiler throws the error C2719.
If the 2nd parameter of the function is changed from value to reference this means the reference of the original variable is handed over. (Although this may technically not fully correct, I imagine it as handing over the address of the original variable instead of a copy on stack.) To prevent accidental overwriting the contents of the referenced variable, a const reference can be used:
std::vector<int> HPR(
typename pcl::PointCloud<PointT>::ConstPtr source,
const pcl::PointXYZ &viewpoint, double param);
Due to the reference, the original variable is used which is either correcly aligned (or would cause another error at another line of source code). For the reference, no special alignment is required anymore.
To use a const (or non-const) reference instead of a value may have additionally a positive performance effect. If the parameter type is something with significant greater size than a "machine word" (i.e. something which fits into a register) than it is worth to pass the reference instead of copying the value. This is probably the case for pcl::PointXYZ considering:
#define PCL_ADD_POINT4D \
EIGEN_ALIGN16 \
union { \
float data[4]; \
struct { \
float x; \
float y; \
float z; \
}; \
} ;
...
struct _PointXYZ
{
PCL_ADD_POINT4D; // This adds the members x,y,z which can also be accessed using the point (which is float[4])
EIGEN_MAKE_ALIGNED_OPERATOR_NEW;
};
and
struct EIGEN_ALIGN16 PointXYZ : public _PointXYZ
(According to float[4], it should consume 16 bytes.)
In oppisition, it's not worth to consider a reference for primitive types like bool, int, and any pointer (which usually should fit into "machine word" width).

What is the recommended way to align memory in C++11

I am working on a single producer single consumer ring buffer implementation.I have two requirements:
Align a single heap allocated instance of a ring buffer to a cache line.
Align a field within a ring buffer to a cache line (to prevent false sharing).
My class looks something like:
#define CACHE_LINE_SIZE 64 // To be used later.
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
....
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_; // This needs to be aligned to a cache line.
};
Let me first tackle point 1 i.e. aligning a single heap allocated instance of the class. There are a few ways:
Use the c++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class alignas(CACHE_LINE_SIZE) RingBuffer {
public:
....
private:
// All the private fields.
};
Use posix_memalign(..) + placement new(..) without altering the class definition. This suffers from not being platform independent:
void* buffer;
if (posix_memalign(&buffer, 64, sizeof(processor::RingBuffer<int, kRingBufferSize>)) != 0) {
perror("posix_memalign did not work!");
abort();
}
// Use placement new on a cache aligned buffer.
auto ring_buffer = new(buffer) processor::RingBuffer<int, kRingBufferSize>();
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer {
public:
....
private:
// All the private fields.
} __attribute__ ((aligned(CACHE_LINE_SIZE)));
I tried to use the C++ 11 standardized aligned_alloc(..) function instead of posix_memalign(..) but GCC 4.8.1 on Ubuntu 12.04 could not find the definition in stdlib.h
Are all of these guaranteed to do the same thing? My goal is cache-line alignment so any method that has some limits on alignment (say double word) will not do. Platform independence which would point to using the standardized alignas(..) is a secondary goal.
I am not clear on whether alignas(..) and __attribute__((aligned(#))) have some limit which could be below the cache line on the machine. I can't reproduce this any more but while printing addresses I think I did not always get 64 byte aligned addresses with alignas(..). On the contrary posix_memalign(..) seemed to always work. Again I cannot reproduce this any more so maybe I was making a mistake.
The second aim is to align a field within a class/struct to a cache line. I am doing this to prevent false sharing. I have tried the following ways:
Use the C++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ alignas(CACHE_LINE_SIZE);
};
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ __attribute__ ((aligned (CACHE_LINE_SIZE)));
};
Both these methods seem to align consumer_sequence to an address 64 bytes after the beginning of the object so whether consumer_sequence is cache aligned depends on whether the object itself is cache aligned. Here my question is - are there any better ways to do the same?
EDIT:
The reason aligned_alloc did not work on my machine was that I was on eglibc 2.15 (Ubuntu 12.04). It worked on a later version of eglibc.
From the man page: The function aligned_alloc() was added to glibc in version 2.16.
This makes it pretty useless for me since I cannot require such a recent version of eglibc/glibc.
Unfortunately the best I have found is allocating extra space and then using the "aligned" part. So the RingBuffer new can request an extra 64 bytes and then return the first 64 byte aligned part of that. It wastes space but will give the alignment you need. You will likely need to set the memory before what is returned to the actual alloc address to unallocate it.
[Memory returned][ptr to start of memory][aligned memory][extra memory]
(assuming no inheritence from RingBuffer) something like:
void * RingBuffer::operator new(size_t request)
{
static const size_t ptr_alloc = sizeof(void *);
static const size_t align_size = 64;
static const size_t request_size = sizeof(RingBuffer)+align_size;
static const size_t needed = ptr_alloc+request_size;
void * alloc = ::operator new(needed);
void *ptr = std::align(align_size, sizeof(RingBuffer),
alloc+ptr_alloc, request_size);
((void **)ptr)[-1] = alloc; // save for delete calls to use
return ptr;
}
void RingBuffer::operator delete(void * ptr)
{
if (ptr) // 0 is valid, but a noop, so prevent passing negative memory
{
void * alloc = ((void **)ptr)[-1];
::operator delete (alloc);
}
}
For the second requirement of having a data member of RingBuffer also 64 byte aligned, for that if you know that the start of this is aligned, you can pad to force the alignment for data members.
The answer to your problem is std::aligned_storage. It can be used top level and for individual members of a class.
After some more research my thoughts are:
Like #TemplateRex pointed out there does not seem to be a standard way to align to more than 16 bytes. So even if we use the standardized alignas(..)there is no guarantee unless the alignment boundary is less than or equal to 16 bytes. I'll have to verify that it works as expected on a target platform.
__attribute ((aligned(#))) or alignas(..) cannot be used to align a heap allocated object as I suspected i.e. new() doesn't do anything with these annotations. They seem to work for static objects or stack allocations with the caveats from (1).
Either posix_memalign(..) (non standard) or aligned_alloc(..) (standardized but couldn't get it to work on GCC 4.8.1) + placement new(..) seems to be the solution. My solution for when I need platform independent code is compiler specific macros :)
Alignment for struct/class fields seems to work with both __attribute ((aligned(#))) and alignas() as noted in the answer. Again I think the caveats from (1) about guarantees on alignment stand.
So my current solution is to use posix_memalign(..) + placement new(..) for aligning a heap allocated instance of my class since my target platform right now is Linux only. I am also using alignas(..) for aligning fields since it's standardized and at least works on Clang and GCC. I'll be happy to change it if a better answer comes along.
I don't know if it is the best way to align memory allocated with a new operator, but it is certainly very simple !
This is the way it is done in thread sanitizer pass in GCC 6.1.0
#define ALIGNED(x) __attribute__((aligned(x)))
static char myarray[sizeof(myClass)] ALIGNED(64) ;
var = new(myarray) myClass;
Well, in sanitizer_common/sanitizer_internal_defs.h, it is also written
// Please only use the ALIGNED macro before the type.
// Using ALIGNED after the variable declaration is not portable!
So I do not know why the ALIGNED here is used after the variable declaration. But it is an other story.

G++ SSE memory alignment on the stack

I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.
The simplest way is std::aligned_storage, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.
A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.
Normally all you should need is:
union vector {
__m128 simd;
float raw[4];
};
i.e. no additional __attribute__ ((aligned (16))) required for the union itself.
This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases.
I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use.
The other answers contain good workarounds though.
If you need an array of N of these objects, allocate vector raw[N+1], and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. This will always be aligned.

an optimized memcpy for small, or fixed size data in gcc

I use memcpy to copy both variable sizes of data and fixed sized data. In some cases I copy small amounts of memory (only a handful of bytes). In GCC I recall that memcpy used to be an intrinsic/builtin. Profiling my code however (with valgrind) I see thousands of calls to the actual "memcpy" function in glibc.
What conditions have to be met to use the builtin function? I can roll my own memcpy quickly, but I'm sure the builtin is more efficient than what I can do.
NOTE: In most cases the amount of data to be copied is available as a compile-time constant.
CXXFLAGS: -O3 -DNDEBUG
The code I'm using now, forcing builtins, if you take off the _builtin prefix the builtin is not used. This is called from various other templates/functions using T=sizeof(type). The sizes that get used are 1, 2, multiples of 4, a few 50-100 byte sizes, and some larger structures.
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}
For the cases where T is small, I'd specialise and use a native assignment.
For example, where T is 1, just assign a single char.
If you know the addresses are aligned, use and appropriately sized int type for your platform.
If the addresses are not aligned, you might be better off doing the appropriate number of char assignments.
The point of this is to avoid a branch and keeping a counter.
Where T is big, I'd be surprised if you do better than the library memcpy(), and the function call overhead is probably going to be lost in the noise. If you do want to optimise, look around at the memcpy() implementations around. There are variants that use extended instructions, etc.
Update:
Looking at your actual(!) question about inlining memcpy, questions like compiler versions and platform become relevant. Out of curiosity, have you tried using std::copy, something like this:
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
std::copy(at, at + T, static_cast<char*>(address));
at += T;
}