Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
Related
I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.
I have designed a benchmark like this
static void value(benchmark::State &state) {
for (auto _ : state) {
const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
Eigen::Matrix<double, 9, 1>::Random();
auto start = std::chrono::high_resolution_clock::now();
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;
benchmark::DoNotOptimize(v);
auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds =
std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
}
I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.
The results of this benchmarks are
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 23.4 ns 113 ns 29974946
ref/manual_time 23.0 ns 111 ns 29934053
with_auto/manual_time 23.6 ns 112 ns 29891056
As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 2475 ns 3070 ns 291032
ref/manual_time 2482 ns 3077 ns 289258
with_auto/manual_time 2436 ns 3012 ns 263170
Again, the three cases behave the same.
If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.
The auto case should be the fastest, as it should be skipping all the writings.
The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?
Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?
One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):
mov rbx, rax
mov rax, qword ptr [rsp + 24]
cmp rax, 2
jle .LBB0_17
cmp rax, 5
jle .LBB0_17
cmp rax, 8
jle .LBB0_17
mov rax, qword ptr [rsp + 16]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16] # xmm1 = mem[0],zero
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
movapd xmm2, xmm0
mulpd xmm2, xmm0
movapd xmm3, xmm2
unpckhpd xmm3, xmm2 # xmm3 = xmm3[1],xmm2[1]
addsd xmm3, xmm2
movapd xmm2, xmm1
mulsd xmm2, xmm1
addsd xmm2, xmm3
movapd xmm3, xmm1
mulsd xmm3, xmm2
unpcklpd xmm2, xmm2 # xmm2 = xmm2[0,0]
mulpd xmm2, xmm0
addpd xmm2, xmm0
movapd xmmword ptr [rsp + 32], xmm2
addsd xmm3, xmm1
movsd qword ptr [rsp + 48], xmm3
This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.
Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.
To check this, I put the code in a function, to look at the assembler:
Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
which turns into this assembler
push rax
mov rax, qword ptr [rsi + 8]
...
mov rax, qword ptr [rsi]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16]
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
...
movupd xmmword ptr [rdi], xmm2
addsd xmm3, xmm1
movsd qword ptr [rdi + 16], xmm3
mov rax, rdi
pop rcx
ret
Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.
This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.
Alternative benchmarks
Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:
struct Foo
{
public:
Eigen::Vector3d v0;
Eigen::Vector3d v1;
Eigen::Vector3d v2;
Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
Eigen::Vector3d operator()() const __attribute__((noinline));
};
Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
v1(vs.segment<3>(3)),
v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
Foo f(vs);
return f();
}
By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.
Purpose of Ref
Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:
Eigen::VectorXd foo(const Eigen::VectorXd&)
This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write
void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);
and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.
template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
const Eigen::MatrixBase<Derived2>& in)
{
Eigen::MatrixBase<Derived1>& mutable_out =
const_cast<Eigen::MatrixBase<Derived1>&>(out);
mutable_out = ...;
}
A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.
Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.
The only reasons to use Ref inside a single cpp file are
To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though
Use with fixed-size types
Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.
void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);
int main()
{
Eigen::VectorXd in = ...;
foo(in.segment<3>(6));
bar(in.segment<3>(6));
}
The following things will happen when you call foo:
We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)
The following happens when we call bar:
We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
A pointer to this pointer is passed to bar
bar loads the pointer from the stack, then loads the values
Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.
We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?
The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:
#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>
auto constexpr N = 1024u;
template <typename TVector>
TVector generate(unsigned min) {
TVector v(N);
for (unsigned i = 0; i < N; ++i)
v[i] = static_cast<double>(min + i);
return v;
}
auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
v1 += v2;
}
void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
static void eigen(benchmark::State& state) {
for (auto _ : state) {
add_vectors(ev1, ev2);
benchmark::DoNotOptimize(ev1);
}
}
static void standard(benchmark::State& state) {
for (auto _ : state) {
add_vectors(sv1, sv2);
benchmark::DoNotOptimize(sv1);
}
}
BENCHMARK(standard);
BENCHMARK(eigen);
I'm running it on Intel Xeon E-2286M #2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 262K (x8)
L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------
standard 99 ns 100 ns 7466667
eigen 169 ns 169 ns 4072727
which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:
// For Eigen::VectorXd
00007FF672221A11 vmovupd ymm0,ymmword ptr [rcx+rax*8]
00007FF672221A16 vaddpd ymm1,ymm0,ymmword ptr [r8+rax*8]
00007FF672221A1C vmovupd ymmword ptr [r8+rax*8],ymm1
00007FF672221A22 add rax,4
00007FF672221A26 cmp rax,rdx
00007FF672221A29 jge eigen+0C7h (07FF672221A37h)
00007FF672221A2B mov rcx,qword ptr [rsp+48h]
00007FF672221A30 mov r8,qword ptr [rsp+58h]
00007FF672221A35 jmp eigen+0A1h (07FF672221A11h)
// For std::vector
00007FF672221B40 vmovups ymm1,ymmword ptr [rax+rdx-20h]
00007FF672221B46 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx-20h]
00007FF672221B4C vmovups ymmword ptr [rax+rcx-20h],ymm1
00007FF672221B52 vmovups ymm1,ymmword ptr [rax+rdx]
00007FF672221B57 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx]
00007FF672221B5C vmovups ymmword ptr [rax+rcx],ymm1
00007FF672221B61 lea rax,[rax+40h]
00007FF672221B65 sub r8,1
00007FF672221B69 jne standard+0C0h (07FF672221B40h)
One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).
FWIW: I've added /Qvec-report:2 and the compiler reports:
[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'
and reason 1305 means "Not enough type information."
My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?
TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.
Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).
Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).
Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.
Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.
Testing the hypothesises
If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:
$LL24#standard:
vmovsd xmm0, QWORD PTR [r9+rcx*8]
vaddsd xmm1, xmm0, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r8+rcx*8], xmm1
mov rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
inc edx
sub rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
sar rax, 3
mov ecx, edx
cmp rcx, rax
jb SHORT $LL24#standard
If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:
$LL24#eigen:
vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
vaddpd xmm1, xmm1, XMMWORD PTR [rcx-16]
vmovupd xmm2, XMMWORD PTR [rcx+rdx]
vmovupd XMMWORD PTR [rcx-16], xmm1
vaddpd xmm1, xmm2, XMMWORD PTR [rcx]
vmovupd XMMWORD PTR [rcx], xmm1
vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
vaddpd ymm1, ymm1, YMMWORD PTR [rcx+16]
vmovups YMMWORD PTR [rcx+16], ymm1
lea rcx, QWORD PTR [rcx+64]
sub rax, 1
jne SHORT $LL24#eigen
Additional notes
It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).
Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.
To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
#pragma omp simd
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:
$LL26#eigen:
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
lea rdx, QWORD PTR [rdx+128]
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
sub r8, 1
jne $LL26#eigen
This code is still not perfect due to the unexpected useless mov instructions.
Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.
Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.
I have this code:
constexpr size_t S = 4;
void add(std::array<float, S>& a, std::array<float, S> b)
{
for (size_t i = 0; i < S; ++i)
a[i] += b[i];
}
Both, clang and gcc, realize that instead of doing 4 single additions they can do one packed addition, using the addps instruction. E.g. clang generates this:
movups xmm2, xmmword ptr [rdi]
movlhps xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
addps xmm0, xmm2
movups xmmword ptr [rdi], xmm0
ret
As you can see on godbolt, gcc is a bit behind clang as it needs more moves. But that's fine. My problem is msvc which is way worse as you can see:
mov eax, DWORD PTR _a$[esp-4]
movups xmm2, XMMWORD PTR _b$[esp-4]
movss xmm1, DWORD PTR [eax+4]
movaps xmm0, xmm2
addss xmm0, DWORD PTR [eax]
movss DWORD PTR [eax], xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 85 ; 00000055H
addss xmm1, xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 170 ; 000000aaH
shufps xmm2, xmm2, 255 ; 000000ffH
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
addss xmm1, xmm0
movss xmm0, DWORD PTR [eax+12]
addss xmm0, xmm2
movss DWORD PTR [eax+8], xmm1
movss DWORD PTR [eax+12], xmm0
ret 0
I tried different optimization levels, but /O2 seems to be the best. I also tried manually unrolling the loop, but no change for msvc.
So, is there a way to make msvc do the same optimization, using one addps instead of four addss? Or is there maybe a good reason why msvc doesn't do it?
Edit
By adding the /Qvec-report:2 flag as suggested by Shawn in the comments (thanks!) I found out that msvc thinks the loop is to small to have any benefit from vectorizing it. Clang and gcc have different opinions, but OK.
And indeed, if I change S to 16, msvc comes up with a vectorized version, even though it still provides a non vectorized branch (completely unnecessary in my opinion as S is known at compile time). In general, msvc's code looks like a mess compared to gcc and clang, see here.
I have tested the code you posted in Microsoft Visual Studio 2017 and it works with me. When I call your function add with aligned and non-aliased parameters, your function add compiles to the addps instruction, not addss. Maybe you are using an older version of Visual Studio?
However, I was able to reproduce your problem by deliberately giving the function non-aligned or aliased parameters. In order to accomplish this, I replaced the function parameters with C-style array pointers (because I don't know how exactly std::array is implemented) and deliberately called the function with aliased pointers, by making the two arrays overlap. In that case, the generated code calls addss four times instead of addps once. Deliberately passing an unaligned pointer had the same effect.
This behavior also makes sense. For vectorization to be meaningful, the compiler must be sure that the arrays do not overlap and that they are properly aligned. I believe alignment is less of an issue with AVX than with SSE.
Of course, the compiler must be able to determine whether there is possible aliasing or alignment issues at compile-time, not at run-time. Therefore, maybe the problem is that you are calling the function in such a way that the compiler can't be sure at compile-time whether the paramaters are aliased and whether the parameters are aligned. Compilers are sometimes not very smart at determining these things. However, as you have pointed out in the comments section, since you are passing one parameter by value, the compiler should be able to determine that there is no danger of overlap. Therefore, my guess is that it is an alignment issue, as the compiler is unsure at compile-time how the contents of std:array is aligned. As I am unable to reproduce your problem using std::array, you may want to post your code on how you are calling the function.
You can also enforce vectorization by explicitly calling the corresponding compiler intrinsic _mm_add_ps for the instruction addps.
The norm member function in the C++ vector class declared below is marked as const and (as far as I can tell) does not contain any side effects.
template <unsigned int N>
struct vector {
double v[N];
double norm() const {
double ret = 0;
for (int i=0; i<N; ++i) {
ret += v[i]*v[i];
}
return ret;
}
};
double test(const vector<100>& x) {
return x.norm() + x.norm();
}
If I call norm multiple times on a const instantiation of vector (see test function above) with the gcc compiler (version 5.4) and optimizations turned on (i.e. -O3) then the compiler inlines norm, but still calculates the result of norm multiple times, even though the result should not change. Why doesn't the compiler optimize out the second call to norm and only calculate this result once? This answer seems to indicate that the compiler should perform this optimization if the compiler determines that the norm function does not have any side-effects. Why doesn't this happen in this case?
Note that I'm determining what the compiler produces using the Compiler Explorer, and that the assembly output for gcc version 5.4 is given below. The clang compiler gives similar results. Note also that if I use gcc's compiler attributes to manually mark norm as a const function using __attribute__((const)), then the second call is optimized out, as I wanted, but my question is why gcc (and clang) do not do this automatically since the norm definition is available?
test(vector<100u>&):
pxor xmm2, xmm2
lea rdx, [rdi+800]
mov rax, rdi
.L2:
movsd xmm1, QWORD PTR [rax]
add rax, 8
cmp rdx, rax
mulsd xmm1, xmm1
addsd xmm2, xmm1
jne .L2
pxor xmm0, xmm0
.L3:
movsd xmm1, QWORD PTR [rdi]
add rdi, 8
cmp rdx, rdi
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L3
addsd xmm0, xmm2
ret
The compiler can calculate the result of norm and reuse it multiple times. E.g. with the -Os switch:
test(vector<100u> const&):
xorps xmm0, xmm0
xor eax, eax
.L2:
movsd xmm1, QWORD PTR [rdi+rax]
add rax, 8
cmp rax, 800
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L2
addsd xmm0, xmm0
ret
The missing optimization isn't due to not-associative-floating-point-math or some observable-behavior-issue.
In a not properly mutexed environment another function might change the contents in the array in between the calls of norm
It may happen, but it's not a concern for the compiler (e.g. https://stackoverflow.com/a/25472679/3235496).
Compiling the example with the -O2 -fdump-tree-all switch you can see that:
g++ correctly detects vector<N>::norm() as a pure function (output file .local-pure-const1);
inlining happens at early stages (output file .einline).
Also note that marking norm with __attribute__ ((noinline)) the compiler performs CSE:
test(vector<100u> const&):
sub rsp, 8
call vector<100u>::norm() const
add rsp, 8
addsd xmm0, xmm0
ret
Marc Glisse is (probably) right.
A more advanced form of Common Subexpression Elimination is required to un-inline the recurrent expression.
I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.