Isn't __m128d aligned natively? - c++

I've this code:
double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];
...
inline void AddIntrinsics(int voiceIndex, int blockSize) {
// assuming blockSize / 2 == 0 and voiceIndex is within the range
int iters = blockSize / 2;
__m128d *pA = (__m128d*)a;
__m128d *pB = (__m128d*)b[voiceIndex];
double *pC = c[voiceIndex];
for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
_mm_store_pd(pC, _mm_add_pd(*pA, *pB));
}
}
But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.
But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?
Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?
If so, how would you align arrays in C++ for this kind of stuff? std::align?
I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.

__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; 2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.

Related

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Are these integers misaligned, and should I even care?

I have some code that interprets multibyte width integers from an array of bytes at an arbitrary address.
std::vector<uint8> m_data; // filled with data
uint32 pos = 13; // position that might not be aligned well for 8 byte integers
...
uint64 * ptr = reinterpret_cast<uint64*>(m_data.data() + pos);
*ptr = swap64(*ptr); // (swaps endianness)
Would alignment be an issue for this code? And if it is, is it a severe issue, or one that can safely be ignored because the penalty is trivial?
Use memcpy instead:
uint64_t x;
memcpy(&x, m_data.data()+pos, sizeof(uint64_t));
x = swap(x);
memcpy(m_data.data()+pos, &x, sizeof(uint64_t));
It has two benefits:
you avoid strict aliasing violation (caused by reading uint8_t buffer as uint64_t)
you don't have to worry about misalignment at all (you do need to care about misalignment, because even on x86, it can crash if the compiler autovectorizes your code)
Current compilers are good enough to do the right thing (i.e., your code will not be slow, memcpy is recognized, and will be handled well).
Some architectures require the read to be aligned to work. They throw a processor signal if the alignment is incorrect.
Depending on the platform it can
Crash the program
Cause a re-run with an unaligned read. (Performance hit)
Just work correctly
Performing a performance measure is a good start, and checking the OS specifications for your target platform would be prudent.

alignment requirements when storing the result of SSE operations

Consider a code fragment using Intel SSE intrinsics like this:
void foo(double* in1ptr, double* in2ptr)
{
double result[8];
/* .. stuff .. */
__m128d in1 = _mm_loadu_pd(in1ptr);
__m128d in2 = _mm_loadu_pd(in2ptr);
__m128d* resptr = (__m128d*)(&result[4]); <----------
*resptr = __mm_add_pd(in1,in2);
/* .. stuff .. */
}
In the indicated line - when declaring resptr to point to the location at index 4 inside result array -
1) This works in gcc, but is this the correct way of doing things?
2) What are the alignment expectations here, can I create the resptr pointer to point to any arbitrary memory location and subsequently store the result of a SSE operation at that memory location?
load/store intrinsics exist to communicate alignment guarantees or lack thereof to the compiler. If your data is 16B-aligned or 32B-aligned, you don't need them.
Just casting to (__m128d*) follows the usual C semantics of implying that the __m128d has sufficient alignment. (Compilers use movapd rather than movupd, and will fault at run-time if the address isn't aligned).
In this case, you didn't do anything to ensure alignment. It's just by luck that your local array is 16B-aligned. If you use alignas(16) double result[8];, that code will be safe.
For unaligned stores, use _mm_storeu_pd. See also the x86 tag wiki.

C++11 : Does new return contiguous memory?

float* tempBuf = new float[maxVoices]();
Will the above result in
1) memory that is 16-byte aligned?
2) memory that is confirmed to be contiguous?
What I want is the following:
float tempBuf[maxVoices] __attribute__ ((aligned));
but as heap memory, that will be effective for Apple Accelerate framework.
Thanks.
The memory will be aligned for float, but not necessarily for CPU-specific SIMD instructions. I strongly suspect on your system sizeof(float) < 16, though, which means it's not as aligned as you want. The memory will be contiguous: &A[i] == &A[0] + i.
If you need something more specific, new std::aligned_storage<Length, Alignment> will return a suitable region of memory, presuming of course that you did in fact pass a more specific alignment.
Another alternative is struct FourFloats alignas(16) {float[4] floats;}; - this may map more naturally to the framework. You'd now need to do new FourFloats[(maxVoices+3)/4].
Yes, new returns contiguous memory.
As for alignment, no such alignment guarantee is provided. Try this:
template<class T, size_t A>
T* over_aligned(size_t N){
static_assert(A <= alignof(std::max_align_t),
"Over-alignment is implementation-defined."
);
static_assert( std::is_trivially_destructible<T>{},
"Function does not store number of elements to destroy"
);
using Helper=std::aligned_storage_t<sizeof(T), A>;
auto* ptr = new Helper[(N+sizeof(Helper)-1)/sizeof(Helper)];
return new(ptr) T[N];
}
Use:
float* f = over_aligned<float,16>(37);
Makes an array of 37 floats, with the buffer aligned to 16 bytes. Or it fails to compile.
If the assert fails, it can still work. Test and consult your compiler documentation. Once convinced, put compiler-specific version guards around the static assert, so when you change compilers you can test all over again (yay).
If you want real portability, you have to have a fall back to std::align and manage resources separately from data pointers and count the number of T if and only if T has a non-trivial destructor, then store the number of T "before" the start of your buffer. It gets pretty silly.
It's guaranteed to be aligned properly with respect to the type you're allocating. So if it's an array of 4 floats (each supposedly 4 bytes), it's guaranteed to provide a usable sequence of floats. It's not guaranteed to be aligned to 16 bytes.
Yes, it's guaranteed to be contiguous (otherwise what would be the meaning of a single pointer?)
If you want it to be aligned to some K bytes, you can do it manually with std::align. See MSalter's answer for a more efficient way of doing this.
If tempBuf is not nullptr then the C++ standard guarantees that tempBuf points to the zeroth element of least maxVoices contiguous floats.
(Don't forget to call delete[] tempBuf once you're done with it.)

Do modern c++ compilers autovectorize code for 24bit image processing?

Do compilers like gcc, visual studio c++, the intel c++ compiler, clang, etc. vectorize code like the following?
std::vector<unsigned char> img( height * width * 3 );
unsigned char channelMultiplier[3];
// ... initialize img and channelMultiplier ...
for ( int y = 0; y < height; ++y )
for ( int x = 0; x < width; ++x )
for ( b = 0; b < 3; ++b )
img[ b+3*(x+width*y) ] = img[ b+3*(x+width*y) ] *
channelMultiplier[b] / 0x100;
How about the same for 32 bit image processing?
I do not think your tripple loop will auto-vectorize. IMO the problems are:
Memory is accessed through an object type std::vector. AFAIK I don't think any compiler will auto-vectorize std::vector code unless the access operators [] or () are inlined but still, it is not clear to me that it will be auto-vectorized.
Your code suffers from memory aliasing, i.e. the compiler doesn't know if the memory you refer to img is accessed from another memory pointer and this will most likely block the vectorization. Basically you need to define a plain double array and hint the compiler that no other pointer is referring to that same location. I think you can do that using __restrict. __restrict tells the compiler that this pointer is the only pointer pointing to that memory location and that there are no other pointers, and thus there is no risk of side effects.
The memory is not aligned by default and even if the compiler manages to auto-vectorize, the vectorization of unaligned memory is a lot slower than that of aligned memory. You need to ensure your memory is 32 memory bit address aligned to exploit auto-vectorization and AVX to the maximum and 16 bit address aligned to exploit SSE to the maximum i.e. always align to 32 memory bit address. This you can do dynamically via:
double* buffer = NULL;
posix_memalign((void**) &buffer, 32, size*sizeof(double));
...
free(buffer);
in MSVC you can do this with __declspec(align(32)) double array[size] but you have to check with the specific compiler you are using to make sure you are using the correct alignment directives.
Another important thing, if you use GNU compiler use the flag -ftree-vectorizer-verbose=6 to check whether your loop is being auto-vectorized. If you use the Intel compiler then use -vec-report5. Note that there are several levels of verbosity and information output i.e. the 6 and 5 numbers so checkout the compiler documentation. The higher the verbosity level the more vectorization information you will get for every loop in your code but the slower the compiler will compile in Release mode.
In general, I have been always surprised how NOT easy is to get the compiler to auto-vectorize, it is a common mistake to assume that because a loop looks canonical then the compiler will auto-vectorize it auto-magically.
UPDATE: and one more thing, make sure your img is actually page-aligned posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double)); (which implies AVX and SSE aligned). The problem is that if you have a big image, this loop will most likely end-up page-switching during execution and that's also very expensive. I think this is what is so-called TLB misses.