How to write portable simd code for complex multiplicative reduction - c++

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
n will be at most 50.
Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:
.L3:
vmovaps xmm0, XMMWORD PTR 16[rsi]
add rsi, 32
vmulps xmm1, xmm0, xmm2
vmulps xmm0, xmm0, xmm3
vfmsubps xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
vmovaps xmm3, xmm1
vfmaddps xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
cmp rdx, rsi
jne .L3
However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.
How can you write C or C++ code for which gcc will produce optimal
code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?
Are there any open source libraries that make this easier?

Here would be an example using the Eigen library:
#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
return Eigen::VectorXcf::Map(x, n).prod();
}
If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x is aligned, you can use MapAligned instead of Map.
You get even better code, if you happen to know the size of your vector at compile time using this:
template<int n>
std::complex<float> f(const std::complex<float> *x)
{
return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}
Note: The functions above directly correspond to the function f of the OP.
However, as #PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.
Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:
typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
return a*b;
}
Or more generic (using C++11):
template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;
template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
return a*b;
}
When compiled with -mavx -O2, this compiles to something like this (using g++-5.4):
vmovaps 32(%rsi), %ymm1
movq %rdi, %rax
vmovaps (%rsi), %ymm0
vmovaps 32(%rdi), %ymm3
vmovaps (%rdi), %ymm4
vmulps %ymm0, %ymm3, %ymm2
vmulps %ymm4, %ymm1, %ymm5
vmulps %ymm4, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm5, %ymm2, %ymm2
vsubps %ymm1, %ymm0, %ymm0
vmovaps %ymm2, 32(%rdi)
vmovaps %ymm0, (%rdi)
vzeroupper
ret
For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rdi)
vmovaps 32(%rsi), %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps (%rdi), %ymm1
vmovaps (%rdx), %ymm2
vmovaps 32(%rdx), %ymm3
vmulps %ymm2, %ymm1, %ymm4
vmulps %ymm3, %ymm0, %ymm5
vsubps %ymm5, %ymm4, %ymm4
vmulps %ymm3, %ymm1, %ymm1
vmulps %ymm0, %ymm2, %ymm0
vaddps %ymm1, %ymm0, %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps %ymm4, (%rdi)
movq %rdi, %rax
vzeroupper
retq
Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:
complex8 f8(complex8 x[], int n) {
if(n==0)
return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?
complex8 p = x[0];
for (int i = 1; i < n; i++) p = prod(p, x[i]);
return p;
}
The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma.
The f8 function can of course also be generalized to arbitrary dimensions:
template<int size>
complexX<size> fX(complexX<size> x[], int n) {
using S= typename complexX<size>::value_type;
if(n==0)
return complexX<size>(S::Ones(),S::Zero());
complexX<size> p = x[0];
for (int i = 1; i < n; i++) p *=x[i];
return p;
}
And for reducing the complexX<N> to a single std::complex the following function can be used:
// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
return std::complex<float>(var.real()[0], var.imag()[0]);
}
However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...)
Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).
Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.

If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance
In this paper(UME::SIMD) an interface based on Vc has been established which
is named UME::SIMD. It allows the programmer to access the SIMD
capabilities without the need for extensive knowledge of SIMD ISAs.
UME::SIMD provides a simple, flexible and portable abstraction for
explicit vectorization without performance losses compared to
intrinsics

I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:
typedef float v4sf __attribute__ ((vector_size (32)));
Also increase all arrays to have 8 elements:
typedef float v8sf __attribute__ ((vector_size (32)));
typedef union {
v8sf v;
float e[8];
} float8;
typedef struct {
float8 x;
float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.
You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.

Related

Combining __restrict__ and __attribute__((aligned(32)))

I want to ensure that gcc knows:
The pointers refer to non-overlapping chunks of memory
The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3"
It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt).
Still unaligned moves.
No, using float *__attribute__((aligned(32))) x means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1
There is a way to do this, but it only helps for gcc, not clang or ICC.
See How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about __attribute__((aligned(32))), which does work for GCC.
I used __restrict instead of __restrict__ because that C++ extension name for C99 restrict is portable to all the mainstream x86 C++ compilers, including MSVC.
typedef float aligned32_float __attribute__((aligned(32)));
void prod(const aligned32_float * __restrict x,
const aligned32_float * __restrict y,
int size,
aligned32_float* __restrict out0)
{
size &= -16ULL;
#if 0 // this works for clang, ICC, and GCC
x = (const float*)__builtin_assume_aligned(x, 32); // have to cast the result in C++
y = (const float*)__builtin_assume_aligned(y, 32);
out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif
for (int i = 0; i < size; ++i) {
out0[i] = x[i] * y[i]; // auto-vectorized with a memory operand for mulps
// note clang using two separate movups loads
// instead of a memory operand for mulps
}
}
(gcc, clang, and ICC output on the Godbolt compiler explorer).
GCC and clang will use movaps / vmovaps instead of ups any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never use movaps for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the -mavx256-split-unaligned-load/store effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.
vmovups is not a performance problem when used on aligned memory; it performs identically to vmovaps on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your -march=haswell output. Only older CPUs, before Nehalem and Bulldozer, always decoded movups to multiple uops.
The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for mulps unless it's aligned.
A good test case for this is out0[i] = x[i] * y[i], where the load result is only needed once. Or out0[i] *= x[i]. Knowing alignment enables movaps/mulps xmm0, [rsi], otherwise it's 2x movups + mulps. You can check for this optimization even on compilers like ICC or MSVC, which use movups even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.
It seems __builtin_assume_aligned is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers to struct aligned_floats { alignas(32) float f[8]; };, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back to float *
I try to use one read and lots of write to saturate the cpu ports for writing.
Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.
If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.
Semi-related: a 2x load + 1x store memory access pattern like c[i] = a[i]+b[i] or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.
But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.
BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.
Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an and rsp,-32 because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)
<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

The performance of multidimensional arrays and arrays of arrays

I always have thought and known that multidimensional arrays to which indexing is done only once by multiplication is faster than arrays of arrays to which indexing is done by two pointer dereferencing, due to better locality and space saving.
I ran a small test a while ago, and the result was quite surprising. At least my callgrind profiler reported that the same function using array of arrays run slightly faster.
I wonder whether I should change the definition of my matrix class to use an array of arrays internally. This class is used virtually everywhere in my simulation engine (? not exactly sure how to call..), and I do want to find the best way to save a few seconds.
test_matrix has the cost of 350 200 020 and test_array_array has the cost of 325 200 016. The code was compiled with -O3 by clang++. All member functions are inlined according to the profiler.
#include <iostream>
#include <memory>
template<class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template<class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size]) {}
template<class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T &operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template<class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes {i, j} {}
template<class T>
T &Matrix<T>::operator()(std::size_t i, std::size_t j) const {
return (*this)[get_index(i, j)];
}
template<class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const {
return i * get_size(2) + j;
}
template<class T>
std::size_t Matrix<T>::get_size(std::size_t d) const {
return sizes[d - 1];
}
template<class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template<class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size) {}
template<class T>
std::size_t Array<T>::get_size() const {
return size;
}
static void __attribute__((noinline)) test_matrix(const Matrix<int> &m) {
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
static_cast<volatile void>(m(i, j) = i + j);
}
}
}
static void __attribute__((noinline))
test_array_array(const Array<Array<int>> &aa) {
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
static_cast<volatile void>(aa[i][j] = i + j);
}
}
}
int main() {
constexpr int N = 1000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
test_matrix(m);
test_array_array(aa);
}
The performance of the two approach is nearly the same because the inner-most loop can optimized the same way in both cases and the computation is likely memory-bound. This means the overhead of the indirection is diluted in the rest of the computation which take most of the time and is subject to variations that can actually be bigger than the overhead. Thus the benchmark is not very sensitive to the difference between the two methods. Here is the assembly code of the inner-most loop (left side: matrix, right side: array of array):
.LBB0_17: .LBB1_30:
movdqa xmm5, xmm1 movdqa xmm5, xmm1
paddq xmm5, xmm4 paddq xmm5, xmm4
movdqa xmm6, xmm0 movdqa xmm6, xmm0
paddq xmm6, xmm4 paddq xmm6, xmm4
shufps xmm5, xmm6, 136 shufps xmm5, xmm6, 136
movdqa xmm6, xmm3 movdqa xmm6, xmm3
paddq xmm6, xmm1 paddq xmm6, xmm1
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm0 paddq xmm7, xmm0
shufps xmm6, xmm7, 136 shufps xmm6, xmm7, 136
movups xmmword ptr [rdi + 4*rbx - 48], xmm5 movups xmmword ptr [rsi + 4*rcx], xmm5
movups xmmword ptr [rdi + 4*rbx - 32], xmm6 movups xmmword ptr [rsi + 4*rcx + 16], xmm6
movdqa xmm5, xmm0 movdqa xmm5, xmm0
paddq xmm5, xmm10 paddq xmm5, xmm10
movdqa xmm6, xmm1 movdqa xmm6, xmm1
paddq xmm6, xmm10 paddq xmm6, xmm10
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm6 paddq xmm7, xmm6
paddq xmm6, xmm4 paddq xmm6, xmm4
movdqa xmm2, xmm3 movdqa xmm2, xmm3
paddq xmm2, xmm5 paddq xmm2, xmm5
paddq xmm5, xmm4 paddq xmm5, xmm4
shufps xmm6, xmm5, 136 shufps xmm6, xmm5, 136
shufps xmm7, xmm2, 136 shufps xmm7, xmm2, 136
movups xmmword ptr [rdi + 4*rbx - 16], xmm6 movups xmmword ptr [rsi + 4*rcx + 32], xmm6
movups xmmword ptr [rdi + 4*rbx], xmm7 movups xmmword ptr [rsi + 4*rcx + 48], xmm7
add rbx, 16 add rcx, 16
paddq xmm1, xmm11 paddq xmm1, xmm11
paddq xmm0, xmm11 paddq xmm0, xmm11
add rbp, 2 add rax, 2
jne .LBB0_17 jne .LBB1_30
As we can see, the loop basically contains the same instructions for the two methods. The order of the stores (movups) is not the same but this should not impact the execution time (especially if the array is aligned in memory). The same thing applies for the different register names. The loop is vectorized using SIMD instructions (SSE) and unrolled 4 times so it can be pretty fast (4 items can be computed per SIMD unit and 16 items per iteration). About 62 iterations are needed for the inner-most loop to complete.
That being said, in both cases, the loops writes 4*1000*1000 = 3.81 MiB of data. This typically fits in the L3 cache on relatively recent processors (or the RAM on old processors). The throughput of the L3/RAM is limited from a core (far lower than the L1 or even the L2 cache) so 1 core will likely stall waiting for the memory hierarchy to be ready. As a result, the loop are not so fast since they spend most of the time waiting for the memory hierarchy. Hardware prefetchers are pretty efficient on modern x86-64 processors so they can prefetech data before a core actually request it, especially for stores and if the written data is contiguous.
The array of array method is generally less efficient because each sub-array is not guaranteed to be allocated contiguously. Modern memory allocators typically use a bucket-based strategy to find memory blocks fitting to the requested size. In a program like this benchmark, the requested memory can be contiguous (or very close to be) since all the arrays are allocated in a raw and the bucket memory is generally not fragmented when a program starts. However, when the memory is fragmented, the arrays tends to be located in non-contiguous regions causing an effect called memory diffusion. Memory diffusion makes things harder for prefetchers to be efficient causing less efficient load/store. This is generally especially true for loads, but stores also cause loads here on most x86-64 processors (Intel processors or recent AMD ones) due to the write-allocate cache policy. Put it shortly, this is one main reason why the array of array method is generally less efficient in application. But this is not the only one : the other comes from the indirections.
The overhead of the additional indirections is pretty small in this benchmark mainly because of the memory-bound inner-loop. The pointers of the sub-arrays are stored contiguously so they can fit in the L1 cache and be efficiently prefetched. This means the indirections can be fast because they are unlikely to cause a cache miss. The indirection instruction cause additional load instructions but since most of the time in waiting the L3 cache or the RAM, the overhead of such instructions is very small if not even negligible. Indeed, modern processors execute instruction in parallel and in an out-of-order way, so the L1 access can be overlapped with L3/RAM load/stores. For example, Intel processors have dedicated units for that: the Line Fill Buffers (between the L1 and L2 caches), the Super-Queue (between the L2 and L3 cache) and the Integrated Memory Controller (between the L3 and the RAM). Most operations are done kind of asynchronously. That being said, things start to be synchronous when cores stall waiting on incoming data or buffers/queues are saturated.
This is possible with a smaller inner-most loop or if the 2D array is travelled non-contiguously. Indeed, if the inner-most loop only compute few items or if it is even replaced with 1 statement, then the overhead of the indirections are much more visible. The processor cannot (easily) overlap the overhead and the array of array method become slower than the matrix-based approach. here is the result of this new benchmark. The gap between the two method seems small but one should keep in mind that the cache is hot during the benchmark while it may not be in a real-world applications. Having a cold cache benefits to the matrix-based method which need fewer data to be loaded from the cache (no need to load the array of pointers).
To understand why the gap is not so huge, we need to analyse the assembly code again. The full assembly code can be seen on Godbolt. Clang use 3 different strategy to speed up the loop (SIMD, scalar+unrolling and scalar) but the unrolled one is the one that should be actually used in this case. Here is the hot loop for the matrix-based method:
.LBB0_27:
mov dword ptr [r12 + rdi], esi
lea ebx, [rsi + 1]
mov dword ptr [r12 + rdx], ebx
lea ebx, [rsi + 2]
mov dword ptr [r12 + rcx], ebx
lea ebx, [rsi + 3]
mov dword ptr [r12 + rax], ebx
add rsi, 4
add r12, r8
cmp rsi, r9
jne .LBB0_27
Here is the one for the array of array:
.LBB1_28:
mov rbp, qword ptr [rdi - 48]
mov dword ptr [rbp], edx
mov rbp, qword ptr [rdi - 32]
lea ebx, [rdx + 1]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi - 16]
lea ebx, [rdx + 2]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi]
lea ebx, [rdx + 3]
mov dword ptr [rbp], ebx
add rdx, 4
add rdi, 64
cmp rsi, rdx
jne .LBB1_28
At first glance, the second one seems clearly less efficient because there is far more instructions to execute. But as said previously, modern processors execute instructions in parallel. Thus, the instruction dependencies and especially the critical path play a significant role in the resulting performance (eg. dependency chains), not to mention the saturation of the processor units en more specifically the saturation of back-end ports of computing cores. Since the performance of this loop is strongly dependent of the target of the target architecture, we should consider a specific processor architecture in order to analyse how fast each method is in this case. Lets choose a relatively-recent mainstream architecture: Intel CoffeeLake.
The first loop is clearly bounded by the store instructions (mov dword ptr [...], ...) since there is only 1 store port on this architecture while lea and add instruction can be executed on multiple ports (and the cmp+jne is cheap because it can be macro-fused and predicted). The loop should take 4 cycles per iteration unless it is bound by the memory hierarchy.
The second loop is more complex but it is also bounded by the store instructions mov dword ptr [rbp], edx. Indeed, CofeeLake has two load ports so 2 mov rbp, qword ptr [...] instructions can be executed per cycle; the same thing is true for the lea which can also be executed on 2 ports; the add and cmp+jne are still cheap. The amount of instruction is not sufficiently big so to saturate the front-end so ports are the bottleneck here. In the end, the loop also takes 4 cycles per iteration assuming the memory hierarchy is not a problem. The thing is the scheduling of the instructions is not always perfect in practice so the dependencies to load instruction can introduce a significant latency if something goes wrong. Since there is a higher pressure on the memory hierarchy, a cache miss would cause the second loop to stall for many cycles as opposed to the first loop (which only do writes). Not to mention a cache miss is more likely to happen in the second case since there is a 8KB buffer of pointers to keep in the L1 cache for this computation to be fast: loading items from the L2 takes a dozen of cycle and loading data to the L3 can cause some cache-lines to be evicted. This is why the second loop is slightly slower in this new benchmark.
What if we use another processor? The result can be significantly different, especially since IceLake (Intel) and Zen2 (AMD) as they have 2 store ports. Things are pretty difficult to analyse on such processors (since not a unique port may be the bottleneck nor actually the back-end at all). This is especially true for Zen2/Zen3 having a 2 shared load/store ports and one dedicated only to stores (meaning 2 loads + 1 store scheduled in 1 cycle, or 1 load + 2 stores, or no load + 3 stores). Thus, the best is certainly to run practical benchmarks on such platforms while taking care to avoid benchmarking biases.
Note that the memory alignment of the sub-array is pretty critical too. With N=1024, the matrix-based method can be significantly slower. This is because the memory layout of the matrix-based method is likely to cause cache trashing in this case while the array-of-array-based method typically adds some padding preventing this issue in this very specific case. The thing is the added padding is typically sizeof(size_t) for mainstream bucket-based allocators so the issue is just happening for another value of N and not really prevented. In fact, for N=1022, the array-of-array-based method is significantly slower. This perfectly match with the above explanation since sizeof(size_t) = 2*sizeof(int) = 8 on the target machine (64-bit). Thus, both methods suffers from this issue but it can be easily controlled with the matrix-based method by adding some padding while it cannot be easily controlled with the array-of-array-based method because the implementation of the allocator is dependent of the platform by default.
I haven't looked through your code in a lot of detail. Instead, I tested your implementations against a really simple wrapper around an std::vector, then added a little bit of timing code so I didn't have to run under a profiler to get a meaningful result. Oh, and I really didn't like the code taking a reference to const, then using a cast to void to allow the code to modify the matrix. I certainly can't imagine expecting people to do that in normal use.
The result looked like this:
#include <chrono>
#include <iomanip>
#include <iostream>
#include <memory>
#include <vector>
template <class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template <class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size])
{
}
template <class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T& operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template <class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes { i, j }
{
}
template <class T>
T& Matrix<T>::operator()(std::size_t i, std::size_t j) const
{
return (*this)[get_index(i, j)];
}
template <class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const
{
return i * get_size(2) + j;
}
template <class T>
std::size_t Matrix<T>::get_size(std::size_t d) const
{
return sizes[d - 1];
}
template <class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template <class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size)
{
}
template <class T>
std::size_t Array<T>::get_size() const
{
return size;
}
static void test_matrix(Matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
m(i, j) = i + j;
}
}
}
static void
test_array_array(Array<Array<int>>& aa)
{
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
aa[i][j] = i + j;
}
}
}
namespace JVC {
template <class T>
class matrix {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix(size_t y, size_t x)
: cols(x)
, rows(y)
, data(x * y)
{
}
T& operator()(size_t y, size_t x)
{
return data[y * cols + x];
}
T operator()(size_t y, size_t x) const
{
return data[y * cols + x];
}
std::size_t get_rows() const { return rows; }
std::size_t get_cols() const { return cols; }
};
static void test_matrix(matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_rows(); ++i) {
for (std::size_t j = 0; j < m.get_cols(); ++j) {
m(i, j) = i + j;
}
}
}
}
template <class F, class C>
void do_test(F f, C &c, std::string const &label) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
f(c);
auto stop = high_resolution_clock::now();
std::cout << std::setw(20) << label << " time: ";
std::cout << duration_cast<milliseconds>(stop - start).count() << " ms\n";
}
int main()
{
std::cout.imbue(std::locale(""));
constexpr int N = 20000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
JVC::matrix<int> m2 { N, N };
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
using namespace std::chrono;
do_test(test_matrix, m, "Matrix");
do_test(test_array_array, aa, "array of arrays");
do_test(JVC::test_matrix, m2, "JVC Matrix");
}
And the result looked like this:
Matrix time: 1,893 ms
array of arrays time: 1,812 ms
JVC Matrix time: 620 ms
So, a trivial wrapper around std::vector is faster than either of your implementations by a factor of about 3.
I would suggest that with this much overhead, it's difficult to be at all certain the timing difference you're seeing stems from storage layout.
To my surprise, your tests are basically correct.
They go against historical knowledge too. (see Dynamic Arrays in C—The Wrong Way).
I corroborated the result with Quickbench and the two timings are almost the same.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
I have no other alternative to say that since your code is so regular the compiler is figuring out that you are asking for consecutive equal-sized allocations which can be replaced by a single block, and in turn, later the hardware can predict the access pattern.
However, I tried making N volatile and inserting a bunch of randomly interleaved allocations at initialization and still get the same result.
I even lowered the optimization to -Og and up to -Ofast and incremented N and I am still getting the same result.
It was only when I used benchmark::ClobberMemory that I see a very small but appreciable difference (with clang, but not with GCC).
So it could have to do with the memory access pattern.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
Another thing that did a (small) difference but is important in real applications was to include the initialization step inside the timing, but, still surprisingly, it was only between 5 and 10% (in favor of single block array).
Conclusion: The compiler, or most likely the hardware, must be doing something really amazing.
The fact the pointer-indirection version is never really faster than the block array makes me think that something is reducing one case to the other in effect.
This deserves more research.
Here it is the machine code if someone is interested https://godbolt.org/z/ssGj7aq7j
Afterthought: Before abandoning contiguous arrays I would at least remain suspicious that this result could be an oddity for 2 dimensions and it is not valid for structures of 3 or 4 dimensions.
Disclaimer: This is interesting to me because I am implementing a multidimensional array library and I care about performance.
The library is a general of your class Matrix for arbitrary dimensions https://gitlab.com/correaa/boost-multi.

Performance difference between member function and global function in release version

I have implemented two functions to perform the cross product of two Vectors (not std::vector), one is a member function and another is a global one, here is the key codes(additional parts are ommitted)
//for member function
template <typename Scalar>
SquareMatrix<Scalar,3> Vector<Scalar,3>::outerProduct(const Vector<Scalar,3> &vec3) const
{
SquareMatrix<Scalar,3> result;
for(unsigned int i = 0; i < 3; ++i)
for(unsigned int j = 0; j < 3; ++j)
result(i,j) = (*this)[i]*vec3[j];
return result;
}
//for global function: Dim = 3
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
They are almost the same except that one is a member function having a return value and another is a global function where the values calculated are straightforwardly assigned to a square matrix, thus requiring no return value.
Actually, I was meant to replace the member one by the global one to improve the performance, since the first one involes copy operations. The strange thing, however, is that the time cost by the global function is almost two times longer than the member one. Furthermore, I find that the execution of
m(i,j) = v1[i]*v2[j]; // in global function
requires much more time than that of
result(i,j) = (*this)[i]*vec3[j]; // in member function
So the question is, how does this performance difference between member and global function arise?
Anyone can tell the reasons?
Hope I have presented my question clearly, and sorry to my poor english!
//----------------------------------------------------------------------------------------
More information added:
The following is the codes I use to test the performance:
//the codes below is in a loop
Vector<double, 3> vec1;
Vector<double, 3> vec2;
Timer timer;
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
std::cout<<"time cost for member function: "<< timer.getElapsedTime()<<std::endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
std::cout<<"time cost for global function: "<< timer.getElapsedTime()<<std::endl;
std::system("pause");
and the result captured:
You can see that the member funtion is almost twice faster than the global one.
Additionally, my project is built upon a 64bit windows system, and the codes are in fact used to generate the static lib files based on the Scons construction tools, along with vs2010 project files produced.
I have to remind that the strange performance difference only occurs in a release version, while in a debug build type, the global function is almost five times faster than the member one.(about 0.10s vs 0.02s)
One possible explanation:
With inlining, in the first case, compiler may knows that result(i, j) (from local variable) doesn't alias this[i] or vec3[j], and so neither of the Scalar array of this nor vec3 are modified.
In the second case, from the function point of view, the variables may alias, so each write into m might modify Scalars of v1 or v2, so neither of v1[i] nor v2[j] can be cached.
You may try the restrict keyword extension to check if my hypothesis is correct.
EDIT: loop elision in the original assembly has been corrected
[paraphrased] Why is the performance different between the member function and the static function?
I'll start with the simplest things mentioned in your question, and progress to the more nuanced points of performance testing / analysis.
It is a bad idea to measure performance of debug builds. Compilers take liberties in many places, such as zeroing arrays that are uninitialized, generating extra code that isn't strictly necessary, and (obviously) not performing any optimization past the trivial ones such as constant propagation. This leads to the next point...
Always look at the assembly. C and C++ are high level languages when it comes to the subtleties of performance. Many people even consider x86 assembly a high level language since each instruction is decomposed into possibly several micro-ops during decoding. You cannot tell what the computer is doing just by looking at C++ code. For example, depending on how you implemented SquareMatrix, the compiler may or may not be able to perform copy elision during optimization.
Entering the somewhat more nuanced topics when testing for performance...
Make sure the compiler is actually generating loops. Using your example test code, g++ 4.7.2 doesn't actually generate loops with my implementation of SquareMatrix and Vector. I implemented them to initialize all components to 0.0, so the compiler can statically determine that the values never change, and so only generates a single set of mov instructions instead of a loop. In my example code, I use COMPILER_NOP which (with gcc) is __asm__ __volatile__("":::) inside the loop to prevent this (as compilers cannot predict side-effects from manual assembly, and so cannot elide the loop). Edit: I DO use COMPILER_NOP but since the output values from the functions are never used, the compiler is still able to remove the bulk of the work from the loop, and reduce the loop to this:
.L7
subl $1, %eax
jne .L7
I have corrected this by performing additional operations inside the loop. The loop now assigns a value from the output to the inputs, preventing this optimization and forcing the loop to cover what was originally intended.
To (finally) get around to answering your question: When I implemented the rest of what is needed to get your code to run, and verified by checking the assembly that loops are actually generated, the two functions execute in the same amount of time. They even have nearly identical implementations in assembly.
Here's the assembly for the member function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L7
.p2align 4,,10
.p2align 3
.L12:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L7:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L12
and the assembly for the static function:
movsd 32(%rsp), %xmm7
movl $100000, %eax
movsd 24(%rsp), %xmm5
movsd 8(%rsp), %xmm6
movapd %xmm7, %xmm12
movsd (%rsp), %xmm4
movapd %xmm7, %xmm11
movapd %xmm5, %xmm10
movapd %xmm5, %xmm9
mulsd %xmm6, %xmm12
mulsd %xmm4, %xmm11
mulsd %xmm6, %xmm10
mulsd %xmm4, %xmm9
movsd 40(%rsp), %xmm1
movsd 16(%rsp), %xmm0
jmp .L9
.p2align 4,,10
.p2align 3
.L13:
movapd %xmm3, %xmm1
movapd %xmm2, %xmm0
.L9:
movapd %xmm0, %xmm8
movapd %xmm1, %xmm3
movapd %xmm1, %xmm2
mulsd %xmm1, %xmm8
movapd %xmm0, %xmm1
mulsd %xmm6, %xmm3
mulsd %xmm4, %xmm2
mulsd %xmm7, %xmm1
mulsd %xmm5, %xmm0
subl $1, %eax
jne .L13
In conclusion: You probably need to tighten your code up a bit before you can tell whether the implementations differ on your system. Make sure your loops are actually being generated (look at the assembly) and see whether the compiler was able to elide the return value from the member function.
If those things are true and you still see differences, can you post the implementations here for SquareMatrix and Vector so we can give you some more info?
Full code, a makefile, and the generated assembly for my working example is available as a GitHub gist.
Explicit instantiations of template function produce performance difference?
Some experiments I have done to look for the performance difference:
1.
Firstly I suspected that the performance difference may be caused by the implementation itself. In fact, we have two sets of implementation, one is implemented by ourself(this one is quite similar to codes by #black), and another is implemented to serve as a wrapper of Eigen::Matrix, which is controlled by a macro on-off, But switch between these two implementation does not make any change, the global one is still slower than the member one.
2.
Since these codes(classVector<Scalar, Dim> & SquareMatrix<Scalar, Dim>) are implemented in a large project, then I guess that the performance difference may probably be influenced by other codes(though I think it impossible, but still worth a try). So I extract all necessary codes(implementation by ourself used), and put them in my manually-generated VS2010 project. Surprisingly but also normally, I find that the global one is slightly faster than the member one, which is the same result as #black #Myles Hathcock, even though I leave the implementation of codes unchanged.
3.
Because in our project, outerProduct are put into a release lib files, while in my manually-generate project, it straightforwardly produce the .obj files, and be link to .exe files. To exclude this issue, I use the codes extracted and produce the lib file through VS2010, and apply this lib file to another VS project to test the performance difference, but still the global one is slight faster than the member one. So, both codes have the same implementation and both of them are put into lib files, though one is produced by Scons and the other is generated by VS project, but they have different performance. Is Scons causing this problem?
4.
For the codes shown in my question, global function outerProduct is declared and defined in .h file, then #include by a .cpp file. So when compiling this .cpp file, outerProduct will be instantiated. But if I change this to another manner:(I have to remind that these codes are now compiled by Scons to product lib file, not manually-generated VS2010 project)
First, I declare the global function outerProduct in .h file:
\\outProduct.h
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m);
then in .cpp file,
\\outerProduct.cpp
template<typename Scalar, int Dim>
void outerProduct(const Vector<Scalar, Dim> & v1 , const Vector<Scalar, Dim> & v2, SquareMatrix<Scalar, Dim> & m)
{
for (unsigned int i=0; i<Dim; i++)
for (unsigned int j=0; j<Dim; j++)
{
m(i,j) = v1[i]*v2[j];
}
}
Since it a template function, It requires some explicit instantiations:
\\outerProduct.cpp
template void outerProduct<double, 3>(const Vector<double, 3> &, const Vector<double, 3> &, SquareMatrix<double, 3> &);
template void outerProduct<float, 3>(const Vector<float, 3> &, const Vector<float, 3> &, SquareMatrix<float, 3> &);
Finally, in .cpp file calling this function:
\\use_outerProduct.cpp
#include "outerProduct.h" //note: outerProduct.cpp is not needful.
...
outerProduct(v1, v2, m)
...
The strange thing, now, is that the global one finally be slightly faster than the member one, shown in the following picture:
But this only happens in a Scons environment. In mannually-generated VS2010 project, global one will always be slightly faster than the member one. So this performance difference only results from a Scons environment? and if template function being explicitly instantiated, it will become normal?
Things are still strange ! It seems that Scons would have done something I didn't expected.
//------------------------------------------------------------------------
Additionally, test codes are now changed to the followings to avoid the loop elision:
Vector<double, 3> vec1(0.0);
Vector<double, 3> vec2(1.0);
Timer timer;
while(true)
{
timer.startTimer();
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
SquareMatrix<double,3> m = vec1.outerProduct(vec2);
}
timer.stopTimer();
cout<<"time cost for member function: "<< timer.getElapsedTime()<<endl;
timer.startTimer();
SquareMatrix<double,3> m;
for (unsigned int i=0; i<100000; i++)
{
vec1 = Vector<double, 3>(i);
outerProduct(vec1, vec2, m);
}
timer.stopTimer();
cout<<"time cost for global function: "<< timer.getElapsedTime()<<endl;
system("pause");
}
#black #Myles Hathcock, Great thanks to warm-hearted people!
#Myles Hathcock, your explanation is really a subtle and abstruse one, but I think I would benefit a lot from it.
Finally, the entire implementation is on
https://github.com/FeiZhu/Physika
Which is a physical engine we are developing, and from which you can find more info including the whole source codes. Vector and SquareMatrix are defined in Physika_Src/Physika_Core folder! But global function outerProduct is not uploaded, you can add it appropriately to somewhere.

sse C++ memory commands

SSE asm has SQRTPS command.
SQRTPS command have 2 versions:
SQRTPS xmm1, xmm2
SQRTPS xmm1, m128
gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps.
But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps).
From Visual Studio, for example:
http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx
What I expect:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
asm{
sqrtps xmm0, data // DIRECTLY FROM MEMORY
movaps result, xmm0
}
What I have (in C):
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_load_ps(&data) // or _mm_set_ps
xmm = _mm_sqrt_ps(xmm);
_mm_store_ps(&result[0], xmm);
(in asm):
movaps xmm1, data
sqrtps xmm0, xmm1 // FROM REGISTER
movaps result, xmm0
In other words, I would like to see something like this:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction)
_mm_store_ps(&result[0], xmm);
Quick research: I made the following file, called mysqrt.cpp:
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a) {
return _mm_sqrt_ps(a[1]);
}
Trying gcc, namely g++4.8 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s:
_MySqrt:
LFB526:
sqrtps 16(%rdi), %xmm0
ret
Clang (clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s):
_MySqrt: ## #MySqrt
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
sqrtps 16(%rdi), %xmm0
popq %rbp
retq
Don't know about VS, but at least both gcc and clang seem to produce memory version of sqrtps if needed.
UPDATE Example of function usage:
#include <iostream>
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a);
int main() {
__m128 x[2];
x[1] = _mm_set_ps1(4);
__m128 y = MySqrt(x);
std::cout << y[0] << std::endl;
}
// output:
2
UPDATE 2: Regarding your code, you should just do:
auto xmm = _mm_sqrt_ps(*reinterpret_cast<__m128*>(data));
And of course it will be at your own risk, you should guarantee that data contains valid __m128 and is properly aligned.
I think you misunderstood the interface provided by the primitive _mm_sqrt_ps(__m128). The argument type here can be a variable hold in memory or in register. The extension type __m128 acts like any normal builtin type, e.g. double, and is not bound to an xmm register but can also be stored in memory.
EDIT Unless you use asm, the compiler determines if and when a variable is loaded into register or left in memory. So, in the following code snippet
__m128 foo(const __m128 x, const __m128*y, std::size_t n)
{
__m128 result = _mm_set_ps(1.0);
while(n--)
result = _mm_mul_ps(result,_mm_add_ps(x,_mm_sqrt_ps(*y++)));
return result;
}
it's up to the compiler which variables are stored in register. I would think that the compiler puts x and result into xmm registers, but gets *y directly from memory.
The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. If you want to be 100% certain then you have to write it in assembly. This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion.
Some code can help explain this.
We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads
float x[4], y[4]
__m128 x4 = _mm_loadu_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This gives (with Intel syntax)
movups xmm0, XMMWORD PTR [rdx]
sqrtps xmm0, xmm0
However, if we do an aligned load we get the other form
float x[4], y[4]
__m128 x4 = _mm_load_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This combines the load and square root into one instruction
sqrtps xmm0, XMMWORD PTR [rax]
Most people would say "trust the compiler." I disagree. If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance.
So once again, if you're using aligned loads, you can only pray that the compiler does what you want. And then maybe on the next version of the compiler it does something different...

Will the compiler unroll this loop?

I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?
template <typename T, u32 size>
class Vector
{
public:
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(const Vector<T, size>& vec)
{
for (u32 i = 0; i < size; ++i)
{
values[i] += vec[i];
}
}
private:
T values[size];
};
Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.
You can do the following trick with disassembly to see how the particular code is compiled.
Vector<int, 16> a, b;
Vector<int, 65536> c, d;
asm("xxx"); // marker
a.Add(b);
asm("yyy"); // marker
c.Add(d);
asm("zzz"); // marker
Now compile
gcc -O3 1.cc -S -o 1.s
And see the disasm
xxx
# 0 "" 2
#NO_APP
movdqa 524248(%rsp), %xmm0
leaq 524248(%rsp), %rsi
paddd 524184(%rsp), %xmm0
movdqa %xmm0, 524248(%rsp)
movdqa 524264(%rsp), %xmm0
paddd 524200(%rsp), %xmm0
movdqa %xmm0, 524264(%rsp)
movdqa 524280(%rsp), %xmm0
paddd 524216(%rsp), %xmm0
movdqa %xmm0, 524280(%rsp)
movdqa 524296(%rsp), %xmm0
paddd 524232(%rsp), %xmm0
movdqa %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
yyy
# 0 "" 2
#NO_APP
leaq 262040(%rsp), %rdx
leaq -104(%rsp), %rcx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqa (%rcx,%rax), %xmm0
paddd (%rdx,%rax), %xmm0
movdqa %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $262144, %rax
jne .L2
#APP
# 38 "1.cc" 1
zzz
As you see, the first loop was small enough to get unrolled. The second is the loop.
First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).
Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.
Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.
Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.
The loop can be unrolled using recursive template instantiation. This may or may not be faster on your C++ implementation.
I adjusted your example slightly, so that it would compile.
typedef unsigned u32; // or something similar
template <typename T, u32 size>
class Vector
{
// need to use an inner class, because member templates of an
// unspecialized template cannot be explicitly specialized.
template<typename Vec, u32 index>
struct Inner
{
static void add(const Vec& a, const Vec& b)
{
a.values[index] = b.values[index];
// triggers recursive instantiation of Inner
Inner<Vec, index-1>::add(a,b);
}
};
// this specialization terminates the recursion
template<typename Vec>
struct Inner<Vec, 0>
{
static void add(const Vec& a, const Vec& b)
{
a.values[0] = b.values[0];
}
};
public:
// PS! this function should probably take a
// _const_ Vector, because the argument is not modified
// Various functions for mathematical operations.
// The functions take in a Vector<T, size>.
// Example:
void add(Vector<T, size>& vec)
{
Inner<Vector, size-1>::add(*this, vec);
}
T values[size];
};
The only way to figure this out is to try it on your own compiler with your own optimization parameters. Make one test file with your "does it unroll" code, test.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a.add( b );
}
then a reference code snippet reference.cpp:
#include "myclass.hpp"
void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) {
a[0] += b[0];
a[1] += b[1];
a[2] += b[2];
}
and now use GCC to compile them and spit out only the assembly:
for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$x.s"; done
In my experience, GCC will unroll loops of 3 or less by default when using loops whose duration are known at compile time; using the -funroll-loops will cause it to unroll even more.
First of all, it is not at all certain that unrolling the loop would be beneficial.
The only possible answer to your question is "it depends" (on the compiler flags, on the value of size, etc).
If you really want to know, ask your compiler: compile into assembly code with typical values of size and with the optimization flags you'd use for real, and examine the result.
Many compilers will unroll this loop, no idea if "the compiler" you are referring to will. There isn't just one compiler in the world.
If you want to guarantee that it's unrolled, then TMP (with inlining) can do that. (This is actually one of the more trivial applications of TMP, often used as an example of metaprogramming).