Eigen::Map alignment of raw buffer - c++

In using the Eigen library, I have a templated C++ class comprising a raw buffer and an Eigen::Map instance as a member. In the constructor of the class, I initialize the map as follows:
template<int size>
class TestEigenMapClass
{
public:
TestEigenMapClass():
vec_(vec_raw_,size)
{
vec_.setZero();
}
Eigen::Map<Eigen::VectorXf> vec_;
private:
int size_ = size;
float vec_raw_[size];
};
The raw buffer is allocated by the system. Do I have to worry about alignment and performance when declaring or initializing the map?
It does work as it is but I am wondering about alignment-caused performance differences when compiling this code in different platforms. In the documentation for the Eigen::Map class, it just says "MapOptions specifies whether the pointer is Aligned, or Unaligned. The default is Unaligned.", but nothing else.

Since the default is unaligned, Eigen will assume no particular alignment. So the code will work fine. The cost of doing so will vary depending on your platform:
On old SSE2 hardware, unaligned memory accesses were very slow
On AVX or AVX2 hardware, they are generally fast
On AVX hardware when compiling only with SSE2-4 instructions for compatibility, you cannot fold the memory operation into the computation, which may have a slight effect on performance, especially front-end (more instructions for the same amount of micro-ops)
On AVX-512 hardware using the full 64 byte vector size, aligned accesses become more important again since it can fetch a single cache-line in one instruction, if properly aligned
See for example this answer for a more complete discussion: Alignment and SSE strange behaviour and here for AVX-512: Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?
If you want to provide proper alignment, you can follow Eigen's guide on that. However, you have to make some adjustments, since your array needs the alignment and is the last member while the alignment of the Map object itself does not matter.
Here is a version that should work in C++17 and up:
template<int size>
class TestEigenMapClass
{
public:
TestEigenMapClass():
vec_(vec_raw_,size)
{
vec_.setZero();
}
Eigen::VectorXf::AlignedMapType vec_;
private:
int size_ = size;
struct alignas(EIGEN_DEFAULT_ALIGN_BYTES) {
float vec_raw_[size];
};
};
Side-note: Not sure why you save the size as an integer member when both the Map and the template know that size. Also, I would create the map on-demand to save memory like this:
template<int size_>
class TestEigenMapClass
{
public:
using map_type = Eigen::VectorXf::AlignedMapType;
using const_map_type = Eigen::VectorXf::ConstAlignedMapType;
TestEigenMapClass() = default;
map_type vec() noexcept
{ return map_type(vec_raw_, size_); }
const_map_type vec() const noexcept
{ return const_map_type(vec_raw_, size_); }
int size() const noexcept
{ return size_; }
private:
struct alignas(EIGEN_DEFAULT_ALIGN_BYTES) {
float vec_raw_[size];
};
};
Also note that you can simply put the alignas on the whole object if the array is the first member. That would also save space on padding bytes within the object.
I also assume you have a good reason not to simply use a fixed-size Eigen type: Eigen::Matrix<float, size, 1>
Q&A
The issue with that is (I think) that creating the map will invoke memory allocation (and some extra instructions) each time the array is accessed via vec()
No. The Map is just a struct with a pointer and a size. It's construction is inlined. This will have zero overhead. Consider this sample code:
void foo(TestEigenMapClass<16>& out,
const TestEigenMapClass<16>& a,
const TestEigenMapClass<16>& b)
{
out.vec() = a.vec() + b.vec();
out.vec() += b.vec() * 2.f;
}
Compiled with GCC-11.3, -std=c++20 -O2 -DNDEBUG it results in this assembly:
foo(TestEigenMapClass<16>&, TestEigenMapClass<16> const&, TestEigenMapClass<16>&):
xor eax, eax
.L2:
movaps xmm0, XMMWORD PTR [rdx+rax*4]
addps xmm0, XMMWORD PTR [rsi+rax*4]
movaps XMMWORD PTR [rdi+rax*4], xmm0
add rax, 4
cmp rax, 16
jne .L2
xor eax, eax
.L3:
movaps xmm0, XMMWORD PTR [rdx+rax*4]
addps xmm0, xmm0
addps xmm0, XMMWORD PTR [rdi+rax*4]
movaps XMMWORD PTR [rdi+rax*4], xmm0
add rax, 4
cmp rax, 16
jne .L3
ret
As you see, zero overhead. Just loading, computing, and storing of float vectors in two loops. Note that for tis to work, you have to compile with -DNDEBUG. Otherwise Eigen will create an assertion and check the alignment at runtime when you use aligned maps. That is the only time an aligned Map may have overhead compared to an unaligned Map. But even then it should not matter for performance. Compilers and CPUs are good at jumping over a few simple checks.
If anything, storing the map has higher overhead since you have to materialize the object in memory between function calls and have to read one more pointer indirection (first read the Map through its reference, then the floats through the Map). It will also make aliasing analysis harder for the compiler.
Will EIGEN_DEFAULT_ALIGN_BYTES lead the compiler to automatically choose the best alignment?
That is a macro set by Eigen. It is chosen depending on the architecture. If you compile for SSE2-4, it is 16, for AVX it is 32. Not sure if AVX-512 will bump this to 64. It's the alignment for that particular architecture.
Be careful with your struct layout. Something like struct { int size; struct alignas(32) { float arr[]; }; }; would waste 28 bytes for padding between the int and the floats. As usual, put the element with the largest alignment first or otherwise take care to not waste space on padding.
I could perhaps just have a fixed size type and use head() each time. I wonder about the differences in performance (this is for a real-time, high-performance prototype).
head, tail, segment etc. are all basically implemented the same as a Map so they have the same non-existent overhead. head() also still carries the compile time information that the vector is properly aligned. If you compile without -DNDEBUG, there will be a range check. But again, even if you keep this activated, it is normally nothing to worry about.
Make sure to use the fixed template parameter instead of the runtime size parameter for these functions, if you can. vector.head<3>() is more efficient than vector.head(3).
If you plan to resize the vector, you can also use the MaxRows template parameter to create a vector that never allocates memory but can change its size within a specific range:
template<int size>
using VariableVector = Eigen::Matrix<
float, Eigen::Dynamic /*rows*/, 1 /*cols*/,
Eigen::ColMajor | Eigen::AutoAlign,
size /*max rows*/, 1 /*max cols*/>;

Related

Returning Vs. Pointer

How much would performance differ between these two situations?
int func(int a, int b) { return a + b; }
And
void func(int a, int b, int * c) { *c = a + b; }
Now, what if it's a struct?
typedef struct { int a; int b; char c; } my;
my func(int a, int b, char c) { my x; x.a = a; x.b = b; x.c = c; return x; }
And
void func(int a, int b, int c, my * x) { x->a = a; x->b = b; x->c = c; }
One thing I can think of is that a register cannot be used for this purpose, correct? Other than that, I am unaware of how this function would turn out after going trough a compiler.
Which would be more efficient and speedy?
If the function can inline, often no difference between the first 2.
Otherwise (no inlining because of no link-time optimization) returning an int by value is more efficient because it's just a value in a register that can be used right away. Also, the caller didn't have to pass as many args, or find/make space to point at. If the caller does want to use the output value, it will have to reload it, introducing latency in the total dependency chain from inputs ready to output ready. (Store-forwarding latency is ~5 cycles on modern x86 CPUs, vs. 1 cycle latency for the lea eax, [rdi + rsi] that would implement that function for x86-64 System V.
The exception is maybe for rare cases where the caller isn't going to use the value, just wants it in memory at some address. Passing that address to the callee (in a register) so it can be used there means the caller doesn't have to keep that address anywhere that will survive across the function call.
For the struct version:
a register cannot be used for this purpose, correct?
No, for some calling conventions, small structs can be returned in registers.
x86-64 System V will return your my struct by value in the RDX:RAX register pair because it's less than 16 bytes and all integer. (And trivially copyable.) Try it on https://godbolt.org/z/x73cEh -
# clang11.0 -O3 for x86-64 SysV
func_val:
shl rsi, 32
mov eax, edi
or rax, rsi # (uint64_t)b<<32 | a; the low 64 bits of the struct
# c was already in EDX, the low half of RDX; clang leaves it there.
ret
func_out:
mov dword ptr [rcx], edi
mov dword ptr [rcx + 4], esi # just store the struct members
mov byte ptr [rcx + 8], dl # to memory pointed-to by 4th arg
ret
GCC doesn't assume that char c is correctly sign-extended to EDX the way clang does (unofficial ABI feature). GCC does a really dumb byte store / dword reload that creates a store-forwarding stall, to get uninitialized garbage from memory instead of from high bytes of EDX. Purely a missed optimization, but see it in https://godbolt.org/z/WGcqKc. It also insanely uses SSE2 to merge the two integers into a 64-bit value before doing a movq rax, xmm0, or to memory for the output-arg.
You definitely want the struct version to inline if the caller uses the values, so this packing into return-value registers can be optimized away.
How does function ACTUALLY return struct variable in C? has an ARM example for a larger struct: return by value passes a hidden pointer to the caller's return-value object. From there, it may need to be copied by the caller if assigning to something that escape analysis can't prove is private. (e.g. through some pointer). What prevents the usage of a function argument as hidden pointer?
Also related: Why is tailcall optimization not performed for types of class MEMORY?
How do C compilers implement functions that return large structures? points out that code-gen may differ between C and C++.
I don't know how to explain any general rule of thumb that one could apply without understand asm and the calling convention you care about. Usually pass/return large structs by reference, but for small structs it's very much "it depends".

Why does p1007r0 std::assume_aligned remove the need for epilogue?

My understanding is that vectorization of code works something like this:
For data in array bellow the first address in the array that is the multiple of 128(or 256 or whatever SIMD instructions require) do slow element by element processing. Let's call this prologue.
For data in array between the first address that is multiple of 128 and last one address that is multiple of 128 use SIMD instruction.
For the data between last address that is multiple of 128 and end of array use slow element by element processing. Let's call this epilogue.
Now I understand why std::assume_aligned helps with prologue, but I do not get why it enables compiler to remove epilogue also.
Quote from proposal:
If we could make this property visible to the compiler, it could skip the loop prologue and epilogue
You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.
gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.
Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.
This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.
// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
p = (int*)__builtin_assume_aligned(p, 64);
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
# gcc7.3 -O3 (arch=tune=generic for x86-64 System V: p in RDI)
lea rax, [rdi+4096] # end pointer
movdqa xmm0, XMMWORD PTR .LC0[rip] # set1_epi32(0x42)
.L2: # do {
add rdi, 16
movaps XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2 # }while(p != endp);
rep ret
This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.
Thus unaligned version includes a prologue and epilogue:
// without any alignment guarantee
void set42(int *p) {
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
~26 instructions of setup, vs. 2 from the aligned version
.L8: # then a bloated loop with 4 uops instead of 3
add eax, 1
add rdx, 16
movaps XMMWORD PTR [rdx-16], xmm0
cmp ecx, eax
ja .L8 # end of main vector loop
# epilogue:
mov eax, esi # then destroy the counter we spent an extra uop on inside the loop. /facepalm
and eax, -4
mov edx, eax
sub r8d, eax
cmp esi, eax
lea rdx, [r9+rdx*4] # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
je .L5
cmp r8d, 1
mov DWORD PTR [rdx], 66 # fully-unrolled final up-to-3 stores
je .L5
cmp r8d, 2
mov DWORD PTR [rdx+4], 66
je .L5
mov DWORD PTR [rdx+8], 66
.L5:
rep ret
Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)
BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)
Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.
This is discussed in the document itself in Section 5:
A function that returns a pointer T* , and guarantees that it will
point to over-aligned memory, could return like this:
T* get_overaligned_ptr()
{
// code...
return std::assume_aligned<N>(_data);
}
This technique can be used e.g. in the begin() and end()
implementations of a class wrapping an over-aligned range of data. As
long as such functions are inline, the over-alignment will be
transparent to the compiler at the call-site, enabling it to perform
the appropriate optimisations without any extra work by the caller.
The begin() and end() methods are data accessors for the over-aligned buffer _data. That is, begin() returns a pointer to the first byte of the buffer and end() returns a pointer to one byte past the last byte of the buffer.
Suppose they are defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return _data + size; // No alignment hint!
}
In this case, the compiler may not be able to eliminate the epilogue. But if there were defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return std::assume_aligned<N>(_data + size);
}
Then the compiler would be able to eliminate the epilogue. For example, if N is 128 bits, then every single 128-bit chunk of the buffer is guaranteed to be 128-bit aligned. Note that this is only possible when the size of the buffer is a multiple of the alignment.

Will std::array template instances occupy more code memory?

I have a micro-controller that does not have an MMU, but we are using C and C++.
We are avoiding all dynamic memory usage (i.e. no new SomeClass() or malloc()) and most of the standard library.
Semi-Question 0:
From what I understand std::array does not use any dynamic memory so its usage should be OK (It is on the stack only). Looking at std::array source code, it looks fine since it creates a c-style array and then wraps functionality around that array.
The chip we are using has 1MB of flash memory for storing code.
Question 1:
I am worried that the use of templates in std::array will cause the binary to be larger, which will then potentially cause the binary to exceed the 1MB code memory limit.
I think if you create an instance of a std::array< int, 5 >, then all calls to functions on that std::array will occupy a certain amount of code memory, lets say X bytes of memory.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more code memory? X bytes of memory + Y bytes of memory.
If so, do you think the amount of code generated given the limited code memory capacity will be a concern?
Question 2:
In the above example, if you created a second std::array< int, 10 > instance, would the calls to functions also duplicate the function calls in the generated code? Even though both instances are of the same type, int?
std::array is considered a zero cost abstraction, which means it should be fairly optimizable by the compiler.
As of any zero cost abstraction, it may induce a small compile time penality, and if the opimizations required te be truely zero cost are not supported, then it may incur a small size or runtime penality.
However, note that compiler are free to add padding at the end of a struct. Since std::array is a struct, you should check how your platform is handling std::array, but I highly doubt it's the case for you.
Take this array and std::array case:
#include <numeric>
#include <iterator>
template<std::size_t n>
int stuff(const int(&arr)[n]) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
int arr[] = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
#include <numeric>
#include <iterator>
#include <array>
template<std::size_t n>
int stuff(const std::array<int, n>& arr) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
std::array arr = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
Clang support this case very well. all cases with std::array or raw arrays are handleld the same way:
-O2 / -O3 both array and std::array with clang:
main: # #main
mov eax, 21
ret
However, GCC seem to have a problem optimizing it, for bith the std::array and the raw array case:
-O3 with GCC for array and std::array:
main:
movdqa xmm0, XMMWORD PTR .LC0[rip]
movaps XMMWORD PTR [rsp-40], xmm0
mov edx, DWORD PTR [rsp-32]
mov eax, DWORD PTR [rsp-28]
lea eax, [rdx+14+rax]
ret
.LC0:
.long 1
.long 2
.long 3
.long 4
Then, it seem to optimize better with -O2 in the case of raw array and fail with std::array:
-O2 GCC std::array:
main:
movabs rax, 8589934593
lea rdx, [rsp-40]
mov ecx, 1
mov QWORD PTR [rsp-40], rax
movabs rax, 17179869187
mov QWORD PTR [rsp-32], rax
movabs rax, 25769803781
lea rsi, [rdx+24]
mov QWORD PTR [rsp-24], rax
xor eax, eax
jmp .L3
.L5:
mov ecx, DWORD PTR [rdx]
.L3:
add rdx, 4
add eax, ecx
cmp rdx, rsi
jne .L5
rep ret
-O2 GCC raw array:
main:
mov eax, 21
ret
It seem that the GCC bug failling to optimize -O3 but succeed with -O2 is fixed in the most recent build.
Here's a compiler explorer with all the O2 and the O3
With all these cases stated, you can see a common pattern: No information about the std::array is outputted in the binary. There are no constructors, no operator[], not even iterators, nor algorithms. Everything is inlined. Compiler are good at inlining simple functions. std::array member functions are usually very very simple.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more flash memory? X bytes of memory + Y bytes of memory.
Well, you changed the data type your array is containing. If you manually add overload of all your functions to handle this additional case, then yes, all those new functions may take up some space. If your function are small, there is a great chance for them to be inlined and take less space. As you can see with the example above, inlining and constant folding may greatly reduce your binary size.
In the above example, if you created a second std::array instance, would the calls to functions also duplicate the function calls in flash memory? Even though both instances are of the same type, int?
Again it depends. If you have many function templated in the size of the array, both std::array and raw arrays may "create" different function. But again, if they are inlined, there is no duplicate to be worried about.
Both will a raw array and std::array, you can pass a pointer to the start of the array and pass the size. If you find this more suitable for your case, then use that, but still raw array and std::array can do that. For raw array, it implicitly decays to a pointer, and with std::array, you must use arr.data() to get the pointer.

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

Should I use SIMD or vector extensions or something else?

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.
Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.
I know that I have more control when I use my own SSE code, but often this control is unnessary.
One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.
Now to my question:
Should I use naked SSE? Perhaps encapsulated.
__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
__m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);
__m128 res = _mm_mul_ps(v1, v2);
Or should the compiler do the dirty work?
float v1 = {0.5f, 2, 4, 0.25f};
float v2 = {2, 0.5f, 0.25f, 4};
float res[4];
res[0] = v1[0]*v2[0];
res[1] = v1[1]*v2[1];
res[2] = v1[2]*v2[2];
res[3] = v1[3]*v2[3];
Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.
Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);
Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.
I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.
PS: This is not a real problem more a design question.
Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:
struct aligned_storage
{
//overload new and delete for 16-byte alignment
};
class vec4 : public aligned_storage
{
public:
vec4(float x, float y, float z, float w)
{
data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
}
vec4(float *data)
{
data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
}
vec4(const vec4 &rhs)
: xmm_(rhs.xmm_)
{
}
...
vec4& operator*=(const vec4 v)
{
xmm_ = _mm_mul_ps(xmm_, v.xmm_);
return *this;
}
...
private:
union
{
__m128 xmm_;
float data_[4];
};
};
Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).
The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.
But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.
JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).
But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.
EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:
#include "vec.h"
#include <iostream>
int main(int argc, char *argv[])
{
math::vec<float,4> u, v, w = u + v;
u = v + dot(v, w) * w;
v = abs(u-w);
u = 3.0f * w + v;
w = -w * (u+v);
v = min(u, w) + length(u) * w;
std::cout << v << std::endl;
return 0;
}
and see what VC10 thinks about it:
...
; 6 : math::vec<float,4> u, v, w = u + v;
movaps xmm4, XMMWORD PTR _v$[esp+32]
; 7 : u = v + dot(v, w) * w;
; 8 : v = abs(u-w);
movaps xmm3, XMMWORD PTR __xmm#0
movaps xmm1, xmm4
addps xmm1, XMMWORD PTR _u$[esp+32]
movaps xmm0, xmm4
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
shufps xmm0, xmm0, 0
mulps xmm0, xmm1
addps xmm0, xmm4
subps xmm0, xmm1
movaps xmm2, xmm3
; 9 : u = 3.0f * w + v;
; 10 : w = -w * (u+v);
xorps xmm3, xmm1
andnps xmm2, xmm0
movaps xmm0, XMMWORD PTR __xmm#1
mulps xmm0, xmm1
addps xmm0, xmm2
; 11 : v = min(u, w) + length(u) * w;
movaps xmm1, xmm0
mulps xmm1, xmm0
haddps xmm1, xmm1
haddps xmm1, xmm1
sqrtss xmm1, xmm1
addps xmm2, xmm0
mulps xmm3, xmm2
shufps xmm1, xmm1, 0
; 12 : std::cout << v << std::endl;
mov edi, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
mulps xmm1, xmm3
minps xmm0, xmm3
addps xmm1, xmm0
movaps XMMWORD PTR _v$[esp+32], xmm1
...
Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.
So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.
EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.
If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.
Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.
But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.
I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.
I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.