How to make msvc vectorize float addition? - c++

I have this code:
constexpr size_t S = 4;
void add(std::array<float, S>& a, std::array<float, S> b)
{
for (size_t i = 0; i < S; ++i)
a[i] += b[i];
}
Both, clang and gcc, realize that instead of doing 4 single additions they can do one packed addition, using the addps instruction. E.g. clang generates this:
movups xmm2, xmmword ptr [rdi]
movlhps xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
addps xmm0, xmm2
movups xmmword ptr [rdi], xmm0
ret
As you can see on godbolt, gcc is a bit behind clang as it needs more moves. But that's fine. My problem is msvc which is way worse as you can see:
mov eax, DWORD PTR _a$[esp-4]
movups xmm2, XMMWORD PTR _b$[esp-4]
movss xmm1, DWORD PTR [eax+4]
movaps xmm0, xmm2
addss xmm0, DWORD PTR [eax]
movss DWORD PTR [eax], xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 85 ; 00000055H
addss xmm1, xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 170 ; 000000aaH
shufps xmm2, xmm2, 255 ; 000000ffH
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
addss xmm1, xmm0
movss xmm0, DWORD PTR [eax+12]
addss xmm0, xmm2
movss DWORD PTR [eax+8], xmm1
movss DWORD PTR [eax+12], xmm0
ret 0
I tried different optimization levels, but /O2 seems to be the best. I also tried manually unrolling the loop, but no change for msvc.
So, is there a way to make msvc do the same optimization, using one addps instead of four addss? Or is there maybe a good reason why msvc doesn't do it?
Edit
By adding the /Qvec-report:2 flag as suggested by Shawn in the comments (thanks!) I found out that msvc thinks the loop is to small to have any benefit from vectorizing it. Clang and gcc have different opinions, but OK.
And indeed, if I change S to 16, msvc comes up with a vectorized version, even though it still provides a non vectorized branch (completely unnecessary in my opinion as S is known at compile time). In general, msvc's code looks like a mess compared to gcc and clang, see here.

I have tested the code you posted in Microsoft Visual Studio 2017 and it works with me. When I call your function add with aligned and non-aliased parameters, your function add compiles to the addps instruction, not addss. Maybe you are using an older version of Visual Studio?
However, I was able to reproduce your problem by deliberately giving the function non-aligned or aliased parameters. In order to accomplish this, I replaced the function parameters with C-style array pointers (because I don't know how exactly std::array is implemented) and deliberately called the function with aliased pointers, by making the two arrays overlap. In that case, the generated code calls addss four times instead of addps once. Deliberately passing an unaligned pointer had the same effect.
This behavior also makes sense. For vectorization to be meaningful, the compiler must be sure that the arrays do not overlap and that they are properly aligned. I believe alignment is less of an issue with AVX than with SSE.
Of course, the compiler must be able to determine whether there is possible aliasing or alignment issues at compile-time, not at run-time. Therefore, maybe the problem is that you are calling the function in such a way that the compiler can't be sure at compile-time whether the paramaters are aliased and whether the parameters are aligned. Compilers are sometimes not very smart at determining these things. However, as you have pointed out in the comments section, since you are passing one parameter by value, the compiler should be able to determine that there is no danger of overlap. Therefore, my guess is that it is an alignment issue, as the compiler is unsure at compile-time how the contents of std:array is aligned. As I am unable to reproduce your problem using std::array, you may want to post your code on how you are calling the function.
You can also enforce vectorization by explicitly calling the corresponding compiler intrinsic _mm_add_ps for the instruction addps.

Related

Eigen::VectorXd::operator += seems ~69% slower than looping through a std::vector

The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:
#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>
auto constexpr N = 1024u;
template <typename TVector>
TVector generate(unsigned min) {
TVector v(N);
for (unsigned i = 0; i < N; ++i)
v[i] = static_cast<double>(min + i);
return v;
}
auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
v1 += v2;
}
void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
static void eigen(benchmark::State& state) {
for (auto _ : state) {
add_vectors(ev1, ev2);
benchmark::DoNotOptimize(ev1);
}
}
static void standard(benchmark::State& state) {
for (auto _ : state) {
add_vectors(sv1, sv2);
benchmark::DoNotOptimize(sv1);
}
}
BENCHMARK(standard);
BENCHMARK(eigen);
I'm running it on Intel Xeon E-2286M #2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 262K (x8)
L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------
standard 99 ns 100 ns 7466667
eigen 169 ns 169 ns 4072727
which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:
// For Eigen::VectorXd
00007FF672221A11 vmovupd ymm0,ymmword ptr [rcx+rax*8]
00007FF672221A16 vaddpd ymm1,ymm0,ymmword ptr [r8+rax*8]
00007FF672221A1C vmovupd ymmword ptr [r8+rax*8],ymm1
00007FF672221A22 add rax,4
00007FF672221A26 cmp rax,rdx
00007FF672221A29 jge eigen+0C7h (07FF672221A37h)
00007FF672221A2B mov rcx,qword ptr [rsp+48h]
00007FF672221A30 mov r8,qword ptr [rsp+58h]
00007FF672221A35 jmp eigen+0A1h (07FF672221A11h)
// For std::vector
00007FF672221B40 vmovups ymm1,ymmword ptr [rax+rdx-20h]
00007FF672221B46 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx-20h]
00007FF672221B4C vmovups ymmword ptr [rax+rcx-20h],ymm1
00007FF672221B52 vmovups ymm1,ymmword ptr [rax+rdx]
00007FF672221B57 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx]
00007FF672221B5C vmovups ymmword ptr [rax+rcx],ymm1
00007FF672221B61 lea rax,[rax+40h]
00007FF672221B65 sub r8,1
00007FF672221B69 jne standard+0C0h (07FF672221B40h)
One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).
FWIW: I've added /Qvec-report:2 and the compiler reports:
[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'
and reason 1305 means "Not enough type information."
My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?
TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.
Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).
Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).
Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.
Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.
Testing the hypothesises
If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:
$LL24#standard:
vmovsd xmm0, QWORD PTR [r9+rcx*8]
vaddsd xmm1, xmm0, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r8+rcx*8], xmm1
mov rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
inc edx
sub rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
sar rax, 3
mov ecx, edx
cmp rcx, rax
jb SHORT $LL24#standard
If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:
$LL24#eigen:
vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
vaddpd xmm1, xmm1, XMMWORD PTR [rcx-16]
vmovupd xmm2, XMMWORD PTR [rcx+rdx]
vmovupd XMMWORD PTR [rcx-16], xmm1
vaddpd xmm1, xmm2, XMMWORD PTR [rcx]
vmovupd XMMWORD PTR [rcx], xmm1
vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
vaddpd ymm1, ymm1, YMMWORD PTR [rcx+16]
vmovups YMMWORD PTR [rcx+16], ymm1
lea rcx, QWORD PTR [rcx+64]
sub rax, 1
jne SHORT $LL24#eigen
Additional notes
It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).
Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.
To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
#pragma omp simd
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:
$LL26#eigen:
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
lea rdx, QWORD PTR [rdx+128]
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
sub r8, 1
jne $LL26#eigen
This code is still not perfect due to the unexpected useless mov instructions.
Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.
Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.

C++ performance std::array vs std::vector

Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2

Why isn't this C++ member function optimized by the compiler with -O3?

The norm member function in the C++ vector class declared below is marked as const and (as far as I can tell) does not contain any side effects.
template <unsigned int N>
struct vector {
double v[N];
double norm() const {
double ret = 0;
for (int i=0; i<N; ++i) {
ret += v[i]*v[i];
}
return ret;
}
};
double test(const vector<100>& x) {
return x.norm() + x.norm();
}
If I call norm multiple times on a const instantiation of vector (see test function above) with the gcc compiler (version 5.4) and optimizations turned on (i.e. -O3) then the compiler inlines norm, but still calculates the result of norm multiple times, even though the result should not change. Why doesn't the compiler optimize out the second call to norm and only calculate this result once? This answer seems to indicate that the compiler should perform this optimization if the compiler determines that the norm function does not have any side-effects. Why doesn't this happen in this case?
Note that I'm determining what the compiler produces using the Compiler Explorer, and that the assembly output for gcc version 5.4 is given below. The clang compiler gives similar results. Note also that if I use gcc's compiler attributes to manually mark norm as a const function using __attribute__((const)), then the second call is optimized out, as I wanted, but my question is why gcc (and clang) do not do this automatically since the norm definition is available?
test(vector<100u>&):
pxor xmm2, xmm2
lea rdx, [rdi+800]
mov rax, rdi
.L2:
movsd xmm1, QWORD PTR [rax]
add rax, 8
cmp rdx, rax
mulsd xmm1, xmm1
addsd xmm2, xmm1
jne .L2
pxor xmm0, xmm0
.L3:
movsd xmm1, QWORD PTR [rdi]
add rdi, 8
cmp rdx, rdi
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L3
addsd xmm0, xmm2
ret
The compiler can calculate the result of norm and reuse it multiple times. E.g. with the -Os switch:
test(vector<100u> const&):
xorps xmm0, xmm0
xor eax, eax
.L2:
movsd xmm1, QWORD PTR [rdi+rax]
add rax, 8
cmp rax, 800
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L2
addsd xmm0, xmm0
ret
The missing optimization isn't due to not-associative-floating-point-math or some observable-behavior-issue.
In a not properly mutexed environment another function might change the contents in the array in between the calls of norm
It may happen, but it's not a concern for the compiler (e.g. https://stackoverflow.com/a/25472679/3235496).
Compiling the example with the -O2 -fdump-tree-all switch you can see that:
g++ correctly detects vector<N>::norm() as a pure function (output file .local-pure-const1);
inlining happens at early stages (output file .einline).
Also note that marking norm with __attribute__ ((noinline)) the compiler performs CSE:
test(vector<100u> const&):
sub rsp, 8
call vector<100u>::norm() const
add rsp, 8
addsd xmm0, xmm0
ret
Marc Glisse is (probably) right.
A more advanced form of Common Subexpression Elimination is required to un-inline the recurrent expression.

What is the purpose of xorps on the same register?

I was looking at the following disassembled c++ code
auto test2 = convert<years, weeks>(2.0);
00007FF6D6475ECC mov eax,16Dh
00007FF6D6475ED1 xorps xmm1,xmm1
00007FF6D6475ED4 cvtsi2sd xmm1,rax
00007FF6D6475ED9 mulsd xmm1,mmword ptr [__real#4000000000000000 (07FF6D64AFE38h)]
00007FF6D6475EE1 divsd xmm1,mmword ptr [__real#401c000000000000 (07FF6D64AFE58h)]
and was curious as to what the point of the xorps xmm1, xmm1 instruction was. It seems like any number xor itself would just give 0? If so, what's the purpose of clearing the register?
Note: I'm just asking this out of pure curiosity. I know very little about assembly language.
The XMM register has 128 bits and using cvtsi2sd only fills up the low 64 bits. Therefore, the xorps instruction is used to clear the possible garbage values and/or dependency chains that would otherwise affect subsequent operations.
Basically, the sequence of operations you have is:
mov eax, 16Dh ; load 0x16D into lower 32 bits of RAX register
xorps xmm1, xmm1 ; zero xmm1
cvtsi2sd xmm1, rax ; load lower 32 bits from RAX into xmm1
<do more stuff with xmm1>
The necessity of zeroing a register is very frequent in assembly when only loading parts of registers where the subsequent instructions operate on their full range. Doing xor x, x is one of the usual register clearing patterns.
See also this (very exhaustive and great, as per comments) answer for more details on why xor can be preferrable to other alternatives (mov x, 0, and x, 0).

Strange compiler behavior when inlining(ASM code included) [duplicate]

This question already exists:
Strange compiler behavior when inlining(ASM code included)
Closed 8 years ago.
My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles:
1.The Vector class aggregates a __m128 member.
2.The Vector is just a typedef of the __m128 member.
In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined.
In case 1 the Vector and the Dot functions look like this:
__declspec(align(16)) class Vector
{
public:
__m128 Entry;
Vector(__m128 s)
{
Entry = s;
}
};
Vector Vector4Dot(Vector v1, Vector v2)
{
return(Vector(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4)));
}
In case 2 the Vector and the Dot functions look like this:
typedef __m128 Vector;
Vector Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1, v2, DotMask4));
}
I'm compiling on MSVC in Visual Studio 2012 on x86 in Release mode with all optimizations enabled, optimize for speed, whole program optimization, etc.Whether I put all the code of case 1 in the header or use this with combination with forceinline, it doesn't matter, it doesn't get inlined.Here is the generated ASM:
Case 1:
movaps xmm0, XMMWORD PTR [esi]
lea eax, DWORD PTR $T1[esp+32]
movaps xmm1, xmm0
push eax
call ?Vector4Dot#Framework##SA?AVVector#23#T__m128##0#Z ; Framework::Vector4Dot
movaps xmm0, XMMWORD PTR [eax]
add esp, 4
movaps XMMWORD PTR [esi], xmm0
lea esi, DWORD PTR [esi+16]
dec edi
jne SHORT $LL3#Test89
This is at the place where I call Vector4Dot.Here is the inside of it(the function):
mov eax, DWORD PTR _v2$[esp-4]
movaps xmm0, XMMWORD PTR [edx]
dpps xmm0, XMMWORD PTR [eax], 255 ; 000000ffH
movaps XMMWORD PTR [ecx], xmm0
mov eax, ecx
For case 2 I just get:
movaps xmm0, XMMWORD PTR [eax]
dpps xmm0, xmm0, 255 ; 000000ffH
movaps XMMWORD PTR [eax], xmm0
lea eax, DWORD PTR [eax+16]
dec ecx
jne SHORT $LL3#Test78
Which is a LOT faster.I'm not sure why the compiler can't deal with that constructor.If I change case1 like this:
__m128 Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4));
}
It compiles at maximum speed the same as case 2.It's this "class overhead" that is giving me the performance penalty.Is there any way to get around this?Or am I stuck with using raw __m128's instead of the Vector class?