Is _mm_load_ps a requirement for 128bit aligned structure? [duplicate] - c++

Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned?
I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead.
What are potential pitfalls I should be aware of?
EDIT :
There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to some fail safe code path.

What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).

There are several ways to put float values into SSE registers; the following intrinsics can be used:
__m128 sseval;
float a, b, c, d;
sseval = _mm_set_ps(a, b, c, d); // make vector from [ a, b, c, d ]
sseval = _mm_setr_ps(a, b, c, d); // make vector from [ d, c, b, a ]
sseval = _mm_load_ps(&a); // ill-specified here - "a" not float[] ...
// same as _mm_set_ps(a[0], a[1], a[2], a[3])
// if you have an actual array
sseval = _mm_set1_ps(a); // make vector from [ a, a, a, a ]
sseval = _mm_load1_ps(&a); // load from &a, replicate - same as previous
sseval = _mm_set_ss(a); // make vector from [ a, 0, 0, 0 ]
sseval = _mm_load_ss(&a); // load from &a, zero others - same as prev
The compiler will often create the same instructions no matter whether you state _mm_set_ss(val) or _mm_load_ss(&val) - try it and disassemble your code.
It can, in some cases, be advantageous to write _mm_set_ss(*valptr) instead of _mm_load_ss(valptr) ... depends on (the structure of) your code.

Going by http://msdn.microsoft.com/en-us/library/ayeb3ayc.aspx, it's possible but not safe or recommended.
You should not access the __m128 fields directly.
And here's the reason why:
http://social.msdn.microsoft.com/Forums/en-US/vclanguage/thread/766c8ddc-2e83-46f0-b5a1-31acbb6ac2c5/
Casting float* to __m128 will not work. C++ compiler converts assignment to __m128 type to SSE instruction loading 4 float numbers to SSE register. Assuming that this casting is compiled, it doesn't create working code, because SEE loading instruction is not generated.
__m128 variable is not actually variable or array. This is placeholder for SSE register, replaced by C++ compiler to SSE Assembly instruction. To understand this better, read Intel Assembly Programming Reference.

A few years have passed since the question was asked. To answer the question my experience shows:
YES
reinterpret_cast-casting a float* into a __m128* and vice versa is good as long as that float* is 16-byte-aligned - example (in MSVC 2012):
__declspec( align( 16 ) ) float f[4];
return _mm_mul_ps( _mm_set_ps1( 1.f ), *reinterpret_cast<__m128*>( f ) );

The obvious issue I can see is that you're than aliasing (referring to a memory location by more than one pointer type), which can confuse the optimiser. Typical issues with aliasing is that since the optimiser doesn't observe that you're modifying a memory location through the original pointer, it considers it to be unchanged.
Since you're obviously not using the optimiser to its full extent (or you'd be willing to rely on it to emit the correct SSE instructions) you'll probably be OK.
The problem with using the intrinsics yourself is that they're designed to operate on SSE registers, and can't use the instruction variants that load from a memory location and process it in a single instruction.

Related

Why can't GCC generate an optimal operator== for a struct of two int32s?

A colleague showed me code that I thought wouldn't be necessary, but sure enough, it was. I would expect most compilers would see all three of these attempts at equality tests as equivalent:
#include <cstdint>
#include <cstring>
struct Point {
std::int32_t x, y;
};
[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
return a.x == b.x && a.y == b.y;
}
[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
// Why can't the compiler produce the same assembly in naiveEqual as it does here?
std::uint64_t ai, bi;
static_assert(sizeof(Point) == sizeof(ai));
std::memcpy(&ai, &a, sizeof(Point));
std::memcpy(&bi, &b, sizeof(Point));
return ai == bi;
}
[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
return std::memcmp(&a, &b, sizeof(a)) == 0;
}
[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
// Let's try avoiding any jumps by using bitwise and:
return (a.x == b.x) & (a.y == b.y);
}
But to my surprise, only the ones with memcpy or memcmp get turned into a single 64-bit compare by GCC. Why? (https://godbolt.org/z/aP1ocs)
Isn't it obvious to the optimizer that if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes?
An attempt to avoid separately booleanizing the two parts compiles somewhat more efficiently (one fewer instruction and no false dependency on EDX), but still two separate 32-bit operations.
bool bithackEqual(const Point &a, const Point &b) {
// a^b == 0 only if they're equal
return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}
GCC and Clang both have the same missed optimizations when passing the structs by value (so a is in RDI and b is in RSI because that's how x86-64 System V's calling convention packs structs into registers): https://godbolt.org/z/v88a6s. The memcpy / memcmp versions both compile to cmp rdi, rsi / sete al, but the others do separate 32-bit operations.
struct alignas(uint64_t) Point surprisingly still helps in the by-value case where arguments are in registers, optimizing both naiveEqual versions for GCC, but not the bithack XOR/OR. (https://godbolt.org/z/ofGa1f). Does this give us any hints about GCC's internals? Clang isn't helped by alignment.
If you "fix" the alignment, all give the same assembly language output (with GCC):
struct alignas(std::int64_t) Point {
std::int32_t x, y;
};
Demo
As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.
There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:
You break store to load forwarding.
If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.
If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.
Why can't the compiler generate [same assembly as memcpy version]?
The compiler "could" in the sense that it would be allowed to.
The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.
If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual and naiveEqual1 and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point as shown in Jarod42's answer has no effect to the optimiser.
MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual.
For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual. For whatever reason, GCC opted to use jne instead of je for the jump.
is this a missing compiler optimization
With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Isn't __m128d aligned natively?

I've this code:
double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];
...
inline void AddIntrinsics(int voiceIndex, int blockSize) {
// assuming blockSize / 2 == 0 and voiceIndex is within the range
int iters = blockSize / 2;
__m128d *pA = (__m128d*)a;
__m128d *pB = (__m128d*)b[voiceIndex];
double *pC = c[voiceIndex];
for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
_mm_store_pd(pC, _mm_add_pd(*pA, *pB));
}
}
But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.
But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?
Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?
If so, how would you align arrays in C++ for this kind of stuff? std::align?
I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.
__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; 2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.

alignment requirements when storing the result of SSE operations

Consider a code fragment using Intel SSE intrinsics like this:
void foo(double* in1ptr, double* in2ptr)
{
double result[8];
/* .. stuff .. */
__m128d in1 = _mm_loadu_pd(in1ptr);
__m128d in2 = _mm_loadu_pd(in2ptr);
__m128d* resptr = (__m128d*)(&result[4]); <----------
*resptr = __mm_add_pd(in1,in2);
/* .. stuff .. */
}
In the indicated line - when declaring resptr to point to the location at index 4 inside result array -
1) This works in gcc, but is this the correct way of doing things?
2) What are the alignment expectations here, can I create the resptr pointer to point to any arbitrary memory location and subsequently store the result of a SSE operation at that memory location?
load/store intrinsics exist to communicate alignment guarantees or lack thereof to the compiler. If your data is 16B-aligned or 32B-aligned, you don't need them.
Just casting to (__m128d*) follows the usual C semantics of implying that the __m128d has sufficient alignment. (Compilers use movapd rather than movupd, and will fault at run-time if the address isn't aligned).
In this case, you didn't do anything to ensure alignment. It's just by luck that your local array is 16B-aligned. If you use alignas(16) double result[8];, that code will be safe.
For unaligned stores, use _mm_storeu_pd. See also the x86 tag wiki.

Memory Access Violations When Using SSE Operations

I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new to SSE, so I've been starting off simple. Here's the entirety of my vector class:
class SSEVector3D
{
public:
SSEVector3D();
SSEVector3D(float x, float y, float z);
SSEVector3D& operator+=(const SSEVector3D& rhs); //< Elementwise Addition
float x() const;
float y() const;
float z() const;
private:
float m_coords[3] __attribute__ ((aligned (16))); //< The x, y and z coordinates
};
So, not a whole lot going on yet, just some constructors, accessors, and one operation. Using my (admittedly limited) knowledge of SSE, I implemented the addition operation as follows:
SSEVector3D& SSEVector3D::operator+=(const SSEVector3D& rhs)
{
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
return (*this);
}
To speed-test my new vector class against the old one (to see if it's worth re-implementing the whole thing), I created a simple program that generates a random array of SSEVector3D objects and adds them together. Nothing too complicated:
SSEVector3D sseSum(0, 0, 0);
for(i=0; i<sseVectors.size(); i++)
{
sseSum += sseVectors[i];
}
printf("Total: %f %f %f\n", sseSum.x(), sseSum.y(), sseSum.z());
The sseVectors variable is an std::vector containing elements of type SSEVector3D, whose components are all initialized to random numbers between -1 and 1.
Here's the issue I'm having. If the size of sseVectors is 8,191 or less (a number I arrived at through a lot of trial and error), this runs fine. If the size is 8,192 or more, I get this error when I try to run it:
signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
However, if I comment out that print statement at the end, I get no error even if sseVectors has a size of 8,192 or more.
Is there something wrong with the way I've written this vector class? I'm running Ubuntu 12.04.1 with GCC version 4.6
First, and foremost, don't do this
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
With SSE, always do your loads and stores explicitly via the appropriate intrinsics, never by just dereferencing. Instead of storing an array of 3 floats in your class, store a value of type _m128. That should make the compiler align instances of your class correctly, without any need for align attributes.
Note, however, that this won't work very well with MSVC. MSVC seems to generally be unable to cope with alignment requirements stronger than 8-byte aligned for by-value arguments :-(. The last time I needed to port SSE code to windows, my solution was to use Intel's C++ compiler for the SSE parts instead of MSVC...
The trick is to notice that __m128 is 16 byte aligned. Use _malloc_aligned() to assure that your float array is correctly aligned, then you can go ahead and cast your float to an array of __m128. Make sure also that the number of floats you allocate is divisible by four.