I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new to SSE, so I've been starting off simple. Here's the entirety of my vector class:
class SSEVector3D
{
public:
SSEVector3D();
SSEVector3D(float x, float y, float z);
SSEVector3D& operator+=(const SSEVector3D& rhs); //< Elementwise Addition
float x() const;
float y() const;
float z() const;
private:
float m_coords[3] __attribute__ ((aligned (16))); //< The x, y and z coordinates
};
So, not a whole lot going on yet, just some constructors, accessors, and one operation. Using my (admittedly limited) knowledge of SSE, I implemented the addition operation as follows:
SSEVector3D& SSEVector3D::operator+=(const SSEVector3D& rhs)
{
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
return (*this);
}
To speed-test my new vector class against the old one (to see if it's worth re-implementing the whole thing), I created a simple program that generates a random array of SSEVector3D objects and adds them together. Nothing too complicated:
SSEVector3D sseSum(0, 0, 0);
for(i=0; i<sseVectors.size(); i++)
{
sseSum += sseVectors[i];
}
printf("Total: %f %f %f\n", sseSum.x(), sseSum.y(), sseSum.z());
The sseVectors variable is an std::vector containing elements of type SSEVector3D, whose components are all initialized to random numbers between -1 and 1.
Here's the issue I'm having. If the size of sseVectors is 8,191 or less (a number I arrived at through a lot of trial and error), this runs fine. If the size is 8,192 or more, I get this error when I try to run it:
signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
However, if I comment out that print statement at the end, I get no error even if sseVectors has a size of 8,192 or more.
Is there something wrong with the way I've written this vector class? I'm running Ubuntu 12.04.1 with GCC version 4.6
First, and foremost, don't do this
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
With SSE, always do your loads and stores explicitly via the appropriate intrinsics, never by just dereferencing. Instead of storing an array of 3 floats in your class, store a value of type _m128. That should make the compiler align instances of your class correctly, without any need for align attributes.
Note, however, that this won't work very well with MSVC. MSVC seems to generally be unable to cope with alignment requirements stronger than 8-byte aligned for by-value arguments :-(. The last time I needed to port SSE code to windows, my solution was to use Intel's C++ compiler for the SSE parts instead of MSVC...
The trick is to notice that __m128 is 16 byte aligned. Use _malloc_aligned() to assure that your float array is correctly aligned, then you can go ahead and cast your float to an array of __m128. Make sure also that the number of floats you allocate is divisible by four.
Related
I'm trying to follow the B.12 section of https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html for atomic add, which works with floats. Simply copying and pasting the code from there and changing the types to floats does not work because I can't perform the casting pointer casting from GLOBAL to PRIVATE, required for the atomicCAS operation. To overcome this I decided to use atomic_xchg() because it works with floats, with additional if statement to achieve same functionality as atomicCAS. However, this returns me varying answer when I perform addition on large float vector every time i run the program.
I've tried figuring out how to overcome the explicit conversion from GLOBAL to PRIVATE, but I honestly don't know how to do it so that when I perform addition, the address argument is changed instead of some temp variable.
kernel void atomicAdd_2(volatile global float* address, float value)
{
float old = *address, assumed;
do {
assumed = old;
if (*address == assumed) {
old = atomic_xchg(address, value + assumed);
}
else{
old = *address;
}
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
}
This is my implementation of atomicAdd for floats.
kernel void reduce_add(global const float* input, global float* output) {
float temp = 242.23f;
atomicAdd_floats(&output[0], temp);
printf(" %f ", output[0]);
}
This is the function where I supply the arguments to the atomicAdd_floats. Note that my input argument contains a vector of floats and output argument is simply where I want to store the result, specifically in the first element of the output vector output[0]; but instead when i printf(" %f ", output[0]); it shows my default initialisation value 0.
First of all, i'd suggest to remove the "kernel" keyword on the atomicAdd_2 function. "kernel" should be used only on functions you intend to enqueue from the host. 2nd, there are OpenCL implementations of atomic-add-float on the net.
Then, i'm a bit confused as to what you're trying to do. Are you trying to sum a vector and while having the sum in a private variable ? if so, it makes no sense to use atomicAdd. Private memory is always atomic, because it's private. Atomicity is only required for global and local memories because they're shared.
Otherwise i'm not sure why you mention changing the address or the GLOBAL to PRIVATE.
Anyway, the code from the link should work, though it's relatively slow. If the vector to sum is large, you might be better off using a different algorithm (wtth partial sums). Try googling "opencl parallel sum" or such.
I've this code:
double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];
...
inline void AddIntrinsics(int voiceIndex, int blockSize) {
// assuming blockSize / 2 == 0 and voiceIndex is within the range
int iters = blockSize / 2;
__m128d *pA = (__m128d*)a;
__m128d *pB = (__m128d*)b[voiceIndex];
double *pC = c[voiceIndex];
for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
_mm_store_pd(pC, _mm_add_pd(*pA, *pB));
}
}
But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.
But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?
Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?
If so, how would you align arrays in C++ for this kind of stuff? std::align?
I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.
__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; 2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.
Consider a code fragment using Intel SSE intrinsics like this:
void foo(double* in1ptr, double* in2ptr)
{
double result[8];
/* .. stuff .. */
__m128d in1 = _mm_loadu_pd(in1ptr);
__m128d in2 = _mm_loadu_pd(in2ptr);
__m128d* resptr = (__m128d*)(&result[4]); <----------
*resptr = __mm_add_pd(in1,in2);
/* .. stuff .. */
}
In the indicated line - when declaring resptr to point to the location at index 4 inside result array -
1) This works in gcc, but is this the correct way of doing things?
2) What are the alignment expectations here, can I create the resptr pointer to point to any arbitrary memory location and subsequently store the result of a SSE operation at that memory location?
load/store intrinsics exist to communicate alignment guarantees or lack thereof to the compiler. If your data is 16B-aligned or 32B-aligned, you don't need them.
Just casting to (__m128d*) follows the usual C semantics of implying that the __m128d has sufficient alignment. (Compilers use movapd rather than movupd, and will fault at run-time if the address isn't aligned).
In this case, you didn't do anything to ensure alignment. It's just by luck that your local array is 16B-aligned. If you use alignas(16) double result[8];, that code will be safe.
For unaligned stores, use _mm_storeu_pd. See also the x86 tag wiki.
Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned?
I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead.
What are potential pitfalls I should be aware of?
EDIT :
There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to some fail safe code path.
What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).
There are several ways to put float values into SSE registers; the following intrinsics can be used:
__m128 sseval;
float a, b, c, d;
sseval = _mm_set_ps(a, b, c, d); // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a); // ill-specified here - "a" not float[] ...
// same as _mm_set_ps(a[0], a[1], a[2], a[3])
// if you have an actual array
sseval = _mm_set1_ps(a); // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a); // load from &a, replicate - same as previous
sseval = _mm_set_ss(a); // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a); // load from &a, zero others - same as prev
The compiler will often create the same instructions no matter whether you state _mm_set_ss(val) or _mm_load_ss(&val) - try it and disassemble your code.
It can, in some cases, be advantageous to write _mm_set_ss(*valptr) instead of _mm_load_ss(valptr) ... depends on (the structure of) your code.
Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.
You should not access the __m128 fields directly.
And here's the reason why:
http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/
Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.
__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.
A few years have passed since the question was asked. To answer the question my experience shows:
YES
reinterpret_cast-casting a float* into a __m128* and vice versa is good as long as that float* is 16-byte-aligned - example (in MSVC 2012):
__declspec( align( 16 ) ) float f[4];
return _mm_mul_ps( _mm_set_ps1( 1.f ), *reinterpret_cast<__m128*>( f ) );
The obvious issue I can see is that you're than aliasing (referring to a memory location by more than one pointer type), which can confuse the optimiser. Typical issues with aliasing is that since the optimiser doesn't observe that you're modifying a memory location through the original pointer, it considers it to be unchanged.
Since you're obviously not using the optimiser to its full extent (or you'd be willing to rely on it to emit the correct SSE instructions) you'll probably be OK.
The problem with using the intrinsics yourself is that they're designed to operate on SSE registers, and can't use the instruction variants that load from a memory location and process it in a single instruction.
I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.
The simplest way is std::aligned_storage, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.
A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.
Normally all you should need is:
union vector {
__m128 simd;
float raw[4];
};
i.e. no additional __attribute__ ((aligned (16))) required for the union itself.
This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases.
I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use.
The other answers contain good workarounds though.
If you need an array of N of these objects, allocate vector raw[N+1], and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. This will always be aligned.