Error in AVX loop vectorization

Error in AVX loop vectorization - c++

When I try to get data with AVX, I get runtime error - Segmentation fault:
int i = 0;
const int sz = 9;
size_t *src1 = (size_t *)_mm_malloc(sz*sizeof(size_t), 32);
size_t *src2 = (size_t *)_mm_malloc(sz*sizeof(size_t), 32);
size_t *dst = (size_t *)_mm_malloc(sz*sizeof(size_t), 32);
__m256 buffer = _mm256_load_si256(&src1[i]);
__m256 buffer2 = _mm256_load_si256(&src2[i+1]); //Segmentation fault in this line
//Something...
_mm256_store_si256(dst[i], buffer);
_mm_free(src1);
_mm_free(src2);
_mm_free(dst);
I solve the problem by using the '_mm256_loadu_si256' intrinsic instead. Someone knows why does this happens?

The _mm*_load_* intrinsics work only with aligned data, whereas the _mm*_loadu_* intrinsics allow you to work with unaligned data (at a performance penalty).
The segmentation fault is telling you that the values you're trying to load from memory into the AVX register are not aligned on the proper boundary. For the 256-bit version, the values must be aligned on a 32-byte boundary.
If you don't want to pay the performance penalty of loading unaligned values, then you need to make sure that the values are properly aligned on 32-byte boundaries. You can do this either by inserting padding or using an annotation that forces alignment. The annotations are compiler-specific—on GCC, you would use something like __attribute__((aligned(32))), whereas on MSVC, you'd use something like __declspec(align(32)).
The problem here, though, is that your array-indexing on the second load is forcing a load from an unaligned memory location. That can't be solved by an attribute/annotation. You're going to have to pad the values out. Using size_t as a pointer type is probably the first mistake. That type should be 32-bytes wide.

Related

Isn't __m128d aligned natively?

I've this code:
double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];
...
inline void AddIntrinsics(int voiceIndex, int blockSize) {
// assuming blockSize / 2 == 0 and voiceIndex is within the range
int iters = blockSize / 2;
__m128d *pA = (__m128d*)a;
__m128d *pB = (__m128d*)b[voiceIndex];
double *pC = c[voiceIndex];
for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
_mm_store_pd(pC, _mm_add_pd(*pA, *pB));
}
}
But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.
But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?
Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?
If so, how would you align arrays in C++ for this kind of stuff? std::align?
I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.

__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; 2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.

Are these integers misaligned, and should I even care?

I have some code that interprets multibyte width integers from an array of bytes at an arbitrary address.
std::vector<uint8> m_data; // filled with data
uint32 pos = 13; // position that might not be aligned well for 8 byte integers
...
uint64 * ptr = reinterpret_cast<uint64*>(m_data.data() + pos);
*ptr = swap64(*ptr); // (swaps endianness)
Would alignment be an issue for this code? And if it is, is it a severe issue, or one that can safely be ignored because the penalty is trivial?

Use memcpy instead:
uint64_t x;
memcpy(&x, m_data.data()+pos, sizeof(uint64_t));
x = swap(x);
memcpy(m_data.data()+pos, &x, sizeof(uint64_t));
It has two benefits:
you avoid strict aliasing violation (caused by reading uint8_t buffer as uint64_t)
you don't have to worry about misalignment at all (you do need to care about misalignment, because even on x86, it can crash if the compiler autovectorizes your code)
Current compilers are good enough to do the right thing (i.e., your code will not be slow, memcpy is recognized, and will be handled well).

Some architectures require the read to be aligned to work. They throw a processor signal if the alignment is incorrect.
Depending on the platform it can
Crash the program
Cause a re-run with an unaligned read. (Performance hit)
Just work correctly
Performing a performance measure is a good start, and checking the OS specifications for your target platform would be prudent.

AVX: data alignment: store crash, storeu, load, loadu doesn't

I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics.
for (b=0; b<(to-from)/8; b++)
{
val = _mm256_setzero_ps();
for (a=from2; a<to2; a++)
{
t1 = _mm256_set1_ps (srcvec.ac[a]);
t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
//val =_mm256_fmadd_ps (t1, t2, t3)
t3 = _mm256_mul_ps(t1,t2);
val = _mm256_add_ps (val, t3);
}
t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
t4 = _mm256_add_ps(t4,val);
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
}
This example crashes on:
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
However if i change to
_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);
(with u for unaligned i suppose) everything works as intended. My question is: why would load work (whereas it is not supposed to, if the data is unaligned) and store doesn't. (furthermore both are operating on the same address).
dest.ac have been allocated using
void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
size_t max_size = (size_t)-1;
// Watch out for overflow
if(elsize == 0 || nelem >= max_size/elsize)
return NULL;
size_t size = nelem * elsize;
void *memory = _mm_malloc(size+64, alignment);
if(memory != NULL)
memset(memory, 0, size);
return memory;
}
and it's at least 50 elements long.
(BTW with VS2012 I have an illegal instruction on some random assignment, so I use linux.)
thank you in advance,
Arkantus.

TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.
Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps compiles to.
I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps at all for _mm256_load_ps, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)
The store, on the other hand, does generate a vmov... instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.
I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.
Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.

Do modern c++ compilers autovectorize code for 24bit image processing?

Do compilers like gcc, visual studio c++, the intel c++ compiler, clang, etc. vectorize code like the following?
std::vector<unsigned char> img( height * width * 3 );
unsigned char channelMultiplier[3];
// ... initialize img and channelMultiplier ...
for ( int y = 0; y < height; ++y )
for ( int x = 0; x < width; ++x )
for ( b = 0; b < 3; ++b )
img[ b+3*(x+width*y) ] = img[ b+3*(x+width*y) ] *
channelMultiplier[b] / 0x100;
How about the same for 32 bit image processing?

I do not think your tripple loop will auto-vectorize. IMO the problems are:
Memory is accessed through an object type std::vector. AFAIK I don't think any compiler will auto-vectorize std::vector code unless the access operators [] or () are inlined but still, it is not clear to me that it will be auto-vectorized.
Your code suffers from memory aliasing, i.e. the compiler doesn't know if the memory you refer to img is accessed from another memory pointer and this will most likely block the vectorization. Basically you need to define a plain double array and hint the compiler that no other pointer is referring to that same location. I think you can do that using __restrict. __restrict tells the compiler that this pointer is the only pointer pointing to that memory location and that there are no other pointers, and thus there is no risk of side effects.
The memory is not aligned by default and even if the compiler manages to auto-vectorize, the vectorization of unaligned memory is a lot slower than that of aligned memory. You need to ensure your memory is 32 memory bit address aligned to exploit auto-vectorization and AVX to the maximum and 16 bit address aligned to exploit SSE to the maximum i.e. always align to 32 memory bit address. This you can do dynamically via:
double* buffer = NULL;
posix_memalign((void**) &buffer, 32, size*sizeof(double));
...
free(buffer);
in MSVC you can do this with __declspec(align(32)) double array[size] but you have to check with the specific compiler you are using to make sure you are using the correct alignment directives.
Another important thing, if you use GNU compiler use the flag -ftree-vectorizer-verbose=6 to check whether your loop is being auto-vectorized. If you use the Intel compiler then use -vec-report5. Note that there are several levels of verbosity and information output i.e. the 6 and 5 numbers so checkout the compiler documentation. The higher the verbosity level the more vectorization information you will get for every loop in your code but the slower the compiler will compile in Release mode.
In general, I have been always surprised how NOT easy is to get the compiler to auto-vectorize, it is a common mistake to assume that because a loop looks canonical then the compiler will auto-vectorize it auto-magically.
UPDATE: and one more thing, make sure your img is actually page-aligned posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double)); (which implies AVX and SSE aligned). The problem is that if you have a big image, this loop will most likely end-up page-switching during execution and that's also very expensive. I think this is what is so-called TLB misses.

Most efficient way to read UInt32 from any memory address?

What would be the most efficient way to read a UInt32 value from an arbitrary memory address in C++? (Assuming Windows x86 or Windows x64 architecture.)
For example, consider having a byte pointer that points somewhere in memory to block that contains a combination of ints, string data, etc., all mixed together. The following sample shows reading the various fields from this block in a loop.
typedef unsigned char* BytePtr;
typedef unsigned int UInt32;
...
BytePtr pCurrent = ...;
while ( *pCurrent != 0 )
{
...
if ( *pCurrent == ... )
{
UInt32 nValue = *( (UInt32*) ( pCurrent + 1 ) ); // line A
...
}
pCurrent += ...;
}
If at line A, pPtr happens to contain a 4-byte-aligned address, reading the UInt32 should be a single memory read. If pPtr contains a non-aligned address, more than one memory cycles my be needed which slows the code down. Is there a faster way to read the value from non-aligned addresses?

I'd recommend memcpy into a temporary of type UInt32 within your loop.
This takes advantage of the fact that a four byte memcpy will be inlined by the compiler when building with optimization enabled, and has a few other benefits:
If you are on a platform where alignment matters (hpux, solaris sparc, ...) your code isn't going to trap.
On a platform where alignment matters there it may be worthwhile to do an address check for alignment then one of a regular aligned load or a set of 4 byte loads and bit ors. Your compiler's memcpy very likely will do this the optimal way.
If you are on a platform where an unaligned access is allowed and doesn't hurt performance (x86, x64, powerpc, ...), you are pretty much guarenteed that such a memcpy is then going to be the cheapest way to do this access.
If your memory was initially a pointer to some other data structure, your code may be undefined because of aliasing problems, because you are casting to another type and dereferencing that cast. Run time problems due to aliasing related optimization issues are very hard to track down! Presuming that you can figure them out, fixing can also be very hard in established code and you may have to use obscure compilation options like -fno-strict-aliasing or -qansialias, which can limit the compiler's optimization ability significantly.

Your code is undefined behaviour.
Pretty much the only "correct" solution is to only read something as a type T if it is a type T, as follows:
uint32_t n;
char * p = point_me_to_random_memory();
std::copy(p, p + 4, reinterpret_cast<char*>(&n));
std::cout << "The value is: " << n << std::endl;
In this example, you want to read an integer, and the only way to do that is to have an integer. If you want it to contain a certain binary representation, you need to copy that data to the address starting at the beginning of the variable.

Let the compiler do the optimizing!
UInt32 ReadU32(unsigned char *ptr)
{
return static_cast<UInt32>(ptr[0]) |
(static_cast<UInt32>(ptr[1])<<8) |
(static_cast<UInt32>(ptr[2])<<16) |
(static_cast<UInt32>(ptr[3])<<24);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js