Strange compiler behavior when inlining(ASM code included) [duplicate] - c++

This question already exists:
Strange compiler behavior when inlining(ASM code included)
Closed 8 years ago.
My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles:
1.The Vector class aggregates a __m128 member.
2.The Vector is just a typedef of the __m128 member.
In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined.
In case 1 the Vector and the Dot functions look like this:
__declspec(align(16)) class Vector
{
public:
__m128 Entry;
Vector(__m128 s)
{
Entry = s;
}
};
Vector Vector4Dot(Vector v1, Vector v2)
{
return(Vector(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4)));
}
In case 2 the Vector and the Dot functions look like this:
typedef __m128 Vector;
Vector Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1, v2, DotMask4));
}
I'm compiling on MSVC in Visual Studio 2012 on x86 in Release mode with all optimizations enabled, optimize for speed, whole program optimization, etc.Whether I put all the code of case 1 in the header or use this with combination with forceinline, it doesn't matter, it doesn't get inlined.Here is the generated ASM:
Case 1:
movaps xmm0, XMMWORD PTR [esi]
lea eax, DWORD PTR $T1[esp+32]
movaps xmm1, xmm0
push eax
call ?Vector4Dot#Framework##SA?AVVector#23#T__m128##0#Z ; Framework::Vector4Dot
movaps xmm0, XMMWORD PTR [eax]
add esp, 4
movaps XMMWORD PTR [esi], xmm0
lea esi, DWORD PTR [esi+16]
dec edi
jne SHORT $LL3#Test89
This is at the place where I call Vector4Dot.Here is the inside of it(the function):
mov eax, DWORD PTR _v2$[esp-4]
movaps xmm0, XMMWORD PTR [edx]
dpps xmm0, XMMWORD PTR [eax], 255 ; 000000ffH
movaps XMMWORD PTR [ecx], xmm0
mov eax, ecx
For case 2 I just get:
movaps xmm0, XMMWORD PTR [eax]
dpps xmm0, xmm0, 255 ; 000000ffH
movaps XMMWORD PTR [eax], xmm0
lea eax, DWORD PTR [eax+16]
dec ecx
jne SHORT $LL3#Test78
Which is a LOT faster.I'm not sure why the compiler can't deal with that constructor.If I change case1 like this:
__m128 Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4));
}
It compiles at maximum speed the same as case 2.It's this "class overhead" that is giving me the performance penalty.Is there any way to get around this?Or am I stuck with using raw __m128's instead of the Vector class?

Related

Performance differences in Eigen between auto and Eigen::Ref and concrete type

I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.
I have designed a benchmark like this
static void value(benchmark::State &state) {
for (auto _ : state) {
const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
Eigen::Matrix<double, 9, 1>::Random();
auto start = std::chrono::high_resolution_clock::now();
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;
benchmark::DoNotOptimize(v);
auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds =
std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
}
I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.
The results of this benchmarks are
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 23.4 ns 113 ns 29974946
ref/manual_time 23.0 ns 111 ns 29934053
with_auto/manual_time 23.6 ns 112 ns 29891056
As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 2475 ns 3070 ns 291032
ref/manual_time 2482 ns 3077 ns 289258
with_auto/manual_time 2436 ns 3012 ns 263170
Again, the three cases behave the same.
If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.
The auto case should be the fastest, as it should be skipping all the writings.
The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?
Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?
One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):
mov rbx, rax
mov rax, qword ptr [rsp + 24]
cmp rax, 2
jle .LBB0_17
cmp rax, 5
jle .LBB0_17
cmp rax, 8
jle .LBB0_17
mov rax, qword ptr [rsp + 16]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16] # xmm1 = mem[0],zero
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
movapd xmm2, xmm0
mulpd xmm2, xmm0
movapd xmm3, xmm2
unpckhpd xmm3, xmm2 # xmm3 = xmm3[1],xmm2[1]
addsd xmm3, xmm2
movapd xmm2, xmm1
mulsd xmm2, xmm1
addsd xmm2, xmm3
movapd xmm3, xmm1
mulsd xmm3, xmm2
unpcklpd xmm2, xmm2 # xmm2 = xmm2[0,0]
mulpd xmm2, xmm0
addpd xmm2, xmm0
movapd xmmword ptr [rsp + 32], xmm2
addsd xmm3, xmm1
movsd qword ptr [rsp + 48], xmm3
This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.
Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.
To check this, I put the code in a function, to look at the assembler:
Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
which turns into this assembler
push rax
mov rax, qword ptr [rsi + 8]
...
mov rax, qword ptr [rsi]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16]
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
...
movupd xmmword ptr [rdi], xmm2
addsd xmm3, xmm1
movsd qword ptr [rdi + 16], xmm3
mov rax, rdi
pop rcx
ret
Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.
This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.
Alternative benchmarks
Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:
struct Foo
{
public:
Eigen::Vector3d v0;
Eigen::Vector3d v1;
Eigen::Vector3d v2;
Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
Eigen::Vector3d operator()() const __attribute__((noinline));
};
Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
v1(vs.segment<3>(3)),
v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
Foo f(vs);
return f();
}
By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.
Purpose of Ref
Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:
Eigen::VectorXd foo(const Eigen::VectorXd&)
This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write
void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);
and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.
template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
const Eigen::MatrixBase<Derived2>& in)
{
Eigen::MatrixBase<Derived1>& mutable_out =
const_cast<Eigen::MatrixBase<Derived1>&>(out);
mutable_out = ...;
}
A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.
Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.
The only reasons to use Ref inside a single cpp file are
To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though
Use with fixed-size types
Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.
void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);
int main()
{
Eigen::VectorXd in = ...;
foo(in.segment<3>(6));
bar(in.segment<3>(6));
}
The following things will happen when you call foo:
We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)
The following happens when we call bar:
We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
A pointer to this pointer is passed to bar
bar loads the pointer from the stack, then loads the values
Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.
We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?

Can I initialize std::unique_ptr of array without default initialization ( I want just let it have dummy value )

Can I initialize std::unique_ptr of array without default initialization ( I want just let it have dummy value )
I'm trying to use unique_ptr of array for preventing memory leak.
but It looks always initializing values with 0 ( look "movups XMMWORD PTR [rax+48], xmm0" )
I don't need this. I just want vector2 have dummy value.
Can I do this??
#include <memory>
struct Vector2
{
float x, y;
};
struct A
{
std::unique_ptr<Vector2[]> vector2;
};
int main()
{
A a;
a.vector2 = std::make_unique<Vector2[]>(10);
}
sub rsp, 40 ; 00000028H
mov QWORD PTR a$[rsp], 0
mov ecx, 80 ; 00000050H
call void * operator new[](unsigned __int64) ; operator new[]
test rax, rax
je SHORT $LN15#main
xorps xmm0, xmm0
movups XMMWORD PTR [rax], xmm0
movups XMMWORD PTR [rax+16], xmm0
movups XMMWORD PTR [rax+32], xmm0
movups XMMWORD PTR [rax+48], xmm0
movups XMMWORD PTR [rax+64], xmm0
make_unique performs value-initialization on the array. You can use make_unique_for_overwrite (since C++20) instead, which performs default-initialization.
a.vector2 = std::make_unique_for_overwrite<Vector2[]>(10);
Before C++20, you can new the array by yourself.
a.vector2 = std::unique_ptr<Vector2[]>(new Vector2[10]);
// or
a.vector2.reset(new Vector2[10]);

How to force the compiler to pass a "vector of 4" wrapper class as single XMM register?

I'm trying to optimize a small "vector of 4 floats" wrapper class, and of course I want to make it convenient as well. For example:
typedef float v4f __attribute__ ((vector_size (16)));
struct V4 {
union {
v4f packed;
#if 1
struct { float r, g, b, a; };
#endif
#if 1
float data[4];
#endif
};
V4() = default;
V4(v4f v) : packed(v) {}
};
V4 AddV4(V4 a, V4 b) {
return a.packed + b.packed;
}
V4 MulV4(V4 a, V4 b) {
return a.packed * b.packed;
}
static_assert(sizeof(V4) == 16);
I know the union is undefined behavior in theory, but in practice it's working fine ;-)
The problem is the following: I tested this in godbolt (see https://godbolt.org/z/fXbtre), using both gcc and clang, with the command line arguments:
-O3 -fomit-frame-pointer -fno-rtti -fno-exceptions -mavx -ffast-math
If I disable both the struct and the array from the union (i.e. set both to #if 0), I get a really compact AddV4 and MulV4 functions, e.g.:
AddV4(V4, V4):
vaddps xmm0, xmm0, xmm1
ret
But if I enable ANY of those two, I get:
AddV4(V4, V4):
vmovq QWORD PTR [rsp-32], xmm1
vmovq QWORD PTR [rsp-40], xmm0
vmovaps xmm5, XMMWORD PTR [rsp-40]
vmovq QWORD PTR [rsp-24], xmm2
vmovq QWORD PTR [rsp-16], xmm3
vaddps xmm4, xmm5, XMMWORD PTR [rsp-24]
vmovaps XMMWORD PTR [rsp-40], xmm4
mov rax, QWORD PTR [rsp-32]
vmovq xmm0, QWORD PTR [rsp-40]
vmovq xmm1, rax
mov QWORD PTR [rsp-24], rax
ret
Can someone explain why? Is there a compiler flag for gcc/clang I could use to fix this? Or is it really the only option to use only the packed data structure? (in that case I need to write accessor methods x(), y(), z(), w(), and that would be quite a big change in our codebase, hence I would prefer another option first).

C++ performance std::array vs std::vector

Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2

Why isn't this C++ member function optimized by the compiler with -O3?

The norm member function in the C++ vector class declared below is marked as const and (as far as I can tell) does not contain any side effects.
template <unsigned int N>
struct vector {
double v[N];
double norm() const {
double ret = 0;
for (int i=0; i<N; ++i) {
ret += v[i]*v[i];
}
return ret;
}
};
double test(const vector<100>& x) {
return x.norm() + x.norm();
}
If I call norm multiple times on a const instantiation of vector (see test function above) with the gcc compiler (version 5.4) and optimizations turned on (i.e. -O3) then the compiler inlines norm, but still calculates the result of norm multiple times, even though the result should not change. Why doesn't the compiler optimize out the second call to norm and only calculate this result once? This answer seems to indicate that the compiler should perform this optimization if the compiler determines that the norm function does not have any side-effects. Why doesn't this happen in this case?
Note that I'm determining what the compiler produces using the Compiler Explorer, and that the assembly output for gcc version 5.4 is given below. The clang compiler gives similar results. Note also that if I use gcc's compiler attributes to manually mark norm as a const function using __attribute__((const)), then the second call is optimized out, as I wanted, but my question is why gcc (and clang) do not do this automatically since the norm definition is available?
test(vector<100u>&):
pxor xmm2, xmm2
lea rdx, [rdi+800]
mov rax, rdi
.L2:
movsd xmm1, QWORD PTR [rax]
add rax, 8
cmp rdx, rax
mulsd xmm1, xmm1
addsd xmm2, xmm1
jne .L2
pxor xmm0, xmm0
.L3:
movsd xmm1, QWORD PTR [rdi]
add rdi, 8
cmp rdx, rdi
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L3
addsd xmm0, xmm2
ret
The compiler can calculate the result of norm and reuse it multiple times. E.g. with the -Os switch:
test(vector<100u> const&):
xorps xmm0, xmm0
xor eax, eax
.L2:
movsd xmm1, QWORD PTR [rdi+rax]
add rax, 8
cmp rax, 800
mulsd xmm1, xmm1
addsd xmm0, xmm1
jne .L2
addsd xmm0, xmm0
ret
The missing optimization isn't due to not-associative-floating-point-math or some observable-behavior-issue.
In a not properly mutexed environment another function might change the contents in the array in between the calls of norm
It may happen, but it's not a concern for the compiler (e.g. https://stackoverflow.com/a/25472679/3235496).
Compiling the example with the -O2 -fdump-tree-all switch you can see that:
g++ correctly detects vector<N>::norm() as a pure function (output file .local-pure-const1);
inlining happens at early stages (output file .einline).
Also note that marking norm with __attribute__ ((noinline)) the compiler performs CSE:
test(vector<100u> const&):
sub rsp, 8
call vector<100u>::norm() const
add rsp, 8
addsd xmm0, xmm0
ret
Marc Glisse is (probably) right.
A more advanced form of Common Subexpression Elimination is required to un-inline the recurrent expression.