I have a struct defined as follows:
struct s_zoneData {
bool finep = true;
double pzone_tcp = 1.0;
double pzone_ori = 1.0;
double pzone_eax = 1.0;
double zone_ori = 0.1;
double zone_leax = 1.0;
double zone_reax = 0.1;
};
I created a comparison operator:
bool operator==(struct s_zoneData i, struct s_zoneData j) {
return (memcmp(&i, &j, sizeof(struct s_zoneData)) == 0);
}
Most of the time, the comparisons failed, even for identical variables. It took me some time (and messing with gdb) to realize that the problem is that the padding bytes for the finep structure element are uninitialized rubbish. For reference, in my machine (x64), sizeof(struct s_zoneData) is 56, which means there are 7 padding bytes for the finep element.
At first, I solved the problem replacing the memcmp with an ULP-based floating-point value comparison for each member of the struct, because I thought there might be rounding issues at play. But now I want to dig deeper in this problem and see possible alternative solutions.
The question is, is there any way to specify a value for the padding bytes, for different compilers and platforms? Or, rewriting it as a more general question because I might be too focused on my approach, what would be the correct way to compare two struct s_zoneData variables?
I know that creating a dummy variable such as char pad[7] and initializing it with zeros should solve the problem (at least for my particular case), but I've read multiple cases where people had struct alignment issues for different compilers and member order, so I'd prefer to go with a standard-defined solution, if that exists. Or at least, something that guarantees compatibility for different platforms and compilers.
While what you're doing would seem logical to a c or assembly programmer (and indeed many c++ programmers), what you are inadvertently doing is breaking the c++ object model and invoking undefined behaviour.
You might want to consider comparisons of value types in terms of tuples of references to their data members.
Comparing two such tuples yields the correct behaviour for ordering comparisons as well as equality.
They also optimise very well.
eg:
#include <tuple>
struct s_zoneData {
bool finep = true;
double pzone_tcp = 1.0;
double pzone_ori = 1.0;
double pzone_eax = 1.0;
double zone_ori = 0.1;
double zone_leax = 1.0;
double zone_reax = 0.1;
friend auto as_tuple(s_zoneData const & z)
{
using std::tie;
return tie(z.finep, z.pzone_tcp, z.pzone_ori, z.pzone_eax, z.zone_ori, z.zone_leax, z.zone_reax);
}
};
auto operator ==(s_zoneData const& l, s_zoneData const& r) -> bool
{
return as_tuple(l) == as_tuple(r);
}
example assembler output:
operator==(s_zoneData const&, s_zoneData const&):
xor eax, eax
movzx ecx, BYTE PTR [rsi]
cmp BYTE PTR [rdi], cl
je .L20
ret
.L20:
movsd xmm0, QWORD PTR [rdi+8]
ucomisd xmm0, QWORD PTR [rsi+8]
jp .L13
jne .L13
movsd xmm0, QWORD PTR [rdi+16]
ucomisd xmm0, QWORD PTR [rsi+16]
jp .L13
jne .L13
movsd xmm0, QWORD PTR [rdi+24]
ucomisd xmm0, QWORD PTR [rsi+24]
jp .L13
jne .L13
movsd xmm0, QWORD PTR [rdi+32]
ucomisd xmm0, QWORD PTR [rsi+32]
jp .L13
jne .L13
movsd xmm0, QWORD PTR [rdi+40]
ucomisd xmm0, QWORD PTR [rsi+40]
jp .L13
jne .L13
movsd xmm0, QWORD PTR [rdi+48]
ucomisd xmm0, QWORD PTR [rsi+48]
mov edx, 0
setnp al
cmovne eax, edx
ret
.L13:
xor eax, eax
ret
The only way to set the padding bytes to something predicatable is to use memset to set the entire structure to something predictable -- if you always use memset to clear values of the structure before setting the fields to something else, then you can rely on the padding bytes to remain unchanged even when you copy the entire structure (as when you pass it as an argument). In addition, a variable with static storage duration will have the padding bytes initialized to 0.
#pragma pack can remove the extra padding.
You can prevent the extra-padding addition by adding it manually so that it can be set explicitly to a predefined value (but the initialization will have to be done outside the struct):
struct s_zoneData {
char pad[sizeof(double)-sizeof(bool)];
bool finep;
double pzone_tcp;
double pzone_ori;
double pzone_eax;
double zone_ori;
double zone_leax;
double zone_reax;
};
...
s_zoneData X = {{},true, 1.0, 1.0, 0.1, 1.0, 0.1};
Edit: Per #Guille comment, the padding should be coupled with the bool member to prevent the internal padding. So either the pad should be immediately before /after finep (I changed the sample to that) or finep should be moved to the end of the structure.
Related
I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.
I have designed a benchmark like this
static void value(benchmark::State &state) {
for (auto _ : state) {
const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
Eigen::Matrix<double, 9, 1>::Random();
auto start = std::chrono::high_resolution_clock::now();
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;
benchmark::DoNotOptimize(v);
auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds =
std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
}
I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.
The results of this benchmarks are
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 23.4 ns 113 ns 29974946
ref/manual_time 23.0 ns 111 ns 29934053
with_auto/manual_time 23.6 ns 112 ns 29891056
As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 2475 ns 3070 ns 291032
ref/manual_time 2482 ns 3077 ns 289258
with_auto/manual_time 2436 ns 3012 ns 263170
Again, the three cases behave the same.
If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.
The auto case should be the fastest, as it should be skipping all the writings.
The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?
Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?
One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):
mov rbx, rax
mov rax, qword ptr [rsp + 24]
cmp rax, 2
jle .LBB0_17
cmp rax, 5
jle .LBB0_17
cmp rax, 8
jle .LBB0_17
mov rax, qword ptr [rsp + 16]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16] # xmm1 = mem[0],zero
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
movapd xmm2, xmm0
mulpd xmm2, xmm0
movapd xmm3, xmm2
unpckhpd xmm3, xmm2 # xmm3 = xmm3[1],xmm2[1]
addsd xmm3, xmm2
movapd xmm2, xmm1
mulsd xmm2, xmm1
addsd xmm2, xmm3
movapd xmm3, xmm1
mulsd xmm3, xmm2
unpcklpd xmm2, xmm2 # xmm2 = xmm2[0,0]
mulpd xmm2, xmm0
addpd xmm2, xmm0
movapd xmmword ptr [rsp + 32], xmm2
addsd xmm3, xmm1
movsd qword ptr [rsp + 48], xmm3
This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.
Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.
To check this, I put the code in a function, to look at the assembler:
Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
which turns into this assembler
push rax
mov rax, qword ptr [rsi + 8]
...
mov rax, qword ptr [rsi]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16]
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
...
movupd xmmword ptr [rdi], xmm2
addsd xmm3, xmm1
movsd qword ptr [rdi + 16], xmm3
mov rax, rdi
pop rcx
ret
Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.
This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.
Alternative benchmarks
Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:
struct Foo
{
public:
Eigen::Vector3d v0;
Eigen::Vector3d v1;
Eigen::Vector3d v2;
Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
Eigen::Vector3d operator()() const __attribute__((noinline));
};
Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
v1(vs.segment<3>(3)),
v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
Foo f(vs);
return f();
}
By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.
Purpose of Ref
Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:
Eigen::VectorXd foo(const Eigen::VectorXd&)
This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write
void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);
and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.
template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
const Eigen::MatrixBase<Derived2>& in)
{
Eigen::MatrixBase<Derived1>& mutable_out =
const_cast<Eigen::MatrixBase<Derived1>&>(out);
mutable_out = ...;
}
A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.
Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.
The only reasons to use Ref inside a single cpp file are
To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though
Use with fixed-size types
Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.
void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);
int main()
{
Eigen::VectorXd in = ...;
foo(in.segment<3>(6));
bar(in.segment<3>(6));
}
The following things will happen when you call foo:
We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)
The following happens when we call bar:
We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
A pointer to this pointer is passed to bar
bar loads the pointer from the stack, then loads the values
Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.
We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?
I have been performing performance optimisations on some code at work, and stumbled upon some strange behaviour, which I've boiled down to the simple snippet of C++ code below:
#include <stdint.h>
void Foo(uint8_t*& out)
{
out[0] = 1;
out[1] = 2;
out[2] = 3;
out[3] = 4;
}
I then compile it with clang (on Windows) with the following: clang -S -O3 -masm=intel test.cpp. This results in the following assembly:
mov rax, qword ptr [rcx]
mov byte ptr [rax], 1
mov rax, qword ptr [rcx]
mov byte ptr [rax + 1], 2
mov rax, qword ptr [rcx]
mov byte ptr [rax + 2], 3
mov rax, qword ptr [rcx]
mov byte ptr [rax + 3], 4
ret
Why has clang generated code that repeatedly dereferences the out parameter into the rax register? This seems like a really obvious optimization that it is deliberately not making, so the question is why?
Interestingly, I've tried changing uint8_t to uint16_t and this much better machine code is generated as a result:
mov rax, qword ptr [rcx]
movabs rcx, 1125912791875585
mov qword ptr [rax], rcx
ret
The compiler cannot do such optimization simply due to strict aliasing as uint8_t is always* defined as unsigned char. Therefore it can point to any memory location, which means it can also point to itself and because you pass it as a reference, the writes can have side-effects inside the function.
Here is obscure, yet correct, usage dependent on non-cached reads:
#include <cassert>
#include <stdint.h>
void Foo(uint8_t*& out)
{
uint8_t local;
// CANNOT be used as a cached value further down in the code.
uint8_t* tmp = out;
// Recover the stored pointer.
uint8_t **orig =reinterpret_cast<uint8_t**>(out);
// CHANGES `out` itself;
*orig=&local;
**orig=5;
assert(local==5);
// IS NOT EQUAL even though we did not touch `out` at all;
assert(tmp!=out);
assert(out==&local);
assert(*out==5);
}
int main(){
// True type of the stored ptr is uint8_t**
uint8_t* ptr = reinterpret_cast<uint8_t*>(&ptr);
Foo(ptr);
}
This also explains why uint16_t generates "optimized" code because uin16_t can never* be (unsigned) char so the compiler is free to assume that it does not alias other pointer types such as itself.
*Maybe some irrelevant obscure platforms with differently-sized bytes. That is beside the point.
Suppose we have 2 POD structs composed of only integral data types (including enums and raw pointers).
struct A
{
std::int64_t x = 0;
std::int64_t y = 1;
};
struct B
{
std::int64_t x = 0;
std::int32_t y = 1;
std::int32_t z = 2;
};
Note that A and B are both 128 bits in size.
Let's also assume we're on 64-bit architecture.
Of course, if x, y, and z weren't integral types, the cost of copy, move, construction and destruction may be different between A and B depending on the implementation details of the members.
But if we assume that x, y, and z are only integral types, is there any cost difference between A and B in terms of:
Construction
Copy Construction/Assignment
Member Access (does alignment play any role here?)
Specifically, is the copy and initialization of two side-by-side 32-bit integers universally more expensive than a single 64-bit integer?
Or is this something specific to compiler and optimization flags?
But if we assume that x, y, and z are only integral types, is there any cost difference between A and B...
Provided that both A and B are trivial types of the same size, there shouldn't be any difference in cost of construction and copying. That's because modern compilers implement store merging:
-fstore-merging
Perform merging of narrow stores to consecutive memory addresses. This pass merges contiguous stores of immediate values narrower than a word into fewer wider stores to reduce the number of instructions. This is enabled by default at -O2 and higher as well as -Os.
Example code:
#include <cstdint>
struct A {
std::int64_t x = 0;
std::int64_t y = 1;
};
struct B {
std::int64_t x = 0;
std::int32_t y = 1;
std::int32_t z = 2;
};
A f0(std::int64_t x, std::int64_t y) {
return {x, y};
}
B f1(std::int64_t x, std::int32_t y, std::int32_t z) {
return {x, y, z};
}
void g0(A);
void g1(B);
void h0(A a) { g0(a); }
void h1(B b) { g1(b); }
Here is generated assembly for construction and copy:
gcc-9.2 -O3 -std=gnu++17 -march=skylake:
f0(long, long):
mov rax, rdi
mov rdx, rsi
ret
f1(long, int, int):
mov QWORD PTR [rsp-16], 0
mov QWORD PTR [rsp-24], rdi
vmovdqa xmm1, XMMWORD PTR [rsp-24]
vpinsrd xmm0, xmm1, esi, 2
vpinsrd xmm2, xmm0, edx, 3
vmovaps XMMWORD PTR [rsp-24], xmm2
mov rax, QWORD PTR [rsp-24]
mov rdx, QWORD PTR [rsp-16]
ret
h0(int, A):
mov rdi, rsi
mov rsi, rdx
jmp g0(A)
h1(int, B):
mov rdi, rsi
mov rsi, rdx
jmp g1(B)
clang-9.0 -O3 -std=gnu++17 -march=skylake:
f0(long, long): # #f0(long, long)
mov rdx, rsi
mov rax, rdi
ret
f1(long, int, int): # #f1(long, int, int)
mov rax, rdi
shl rdx, 32
mov ecx, esi
or rdx, rcx
ret
h0(int, A): # #h0(int, A)
mov rdi, rsi
mov rsi, rdx
jmp g0(A) # TAILCALL
h1(int, B): # #h1(int, B)
mov rdi, rsi
mov rsi, rdx
jmp g1(B) # TAILCALL
Note how both structures are passed in registers in h0 and h1.
However, gcc botches code for construction of B by generating unnecessary AVX instructions. Filed a bug report.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
This question already exists:
Strange compiler behavior when inlining(ASM code included)
Closed 8 years ago.
My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles:
1.The Vector class aggregates a __m128 member.
2.The Vector is just a typedef of the __m128 member.
In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined.
In case 1 the Vector and the Dot functions look like this:
__declspec(align(16)) class Vector
{
public:
__m128 Entry;
Vector(__m128 s)
{
Entry = s;
}
};
Vector Vector4Dot(Vector v1, Vector v2)
{
return(Vector(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4)));
}
In case 2 the Vector and the Dot functions look like this:
typedef __m128 Vector;
Vector Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1, v2, DotMask4));
}
I'm compiling on MSVC in Visual Studio 2012 on x86 in Release mode with all optimizations enabled, optimize for speed, whole program optimization, etc.Whether I put all the code of case 1 in the header or use this with combination with forceinline, it doesn't matter, it doesn't get inlined.Here is the generated ASM:
Case 1:
movaps xmm0, XMMWORD PTR [esi]
lea eax, DWORD PTR $T1[esp+32]
movaps xmm1, xmm0
push eax
call ?Vector4Dot#Framework##SA?AVVector#23#T__m128##0#Z ; Framework::Vector4Dot
movaps xmm0, XMMWORD PTR [eax]
add esp, 4
movaps XMMWORD PTR [esi], xmm0
lea esi, DWORD PTR [esi+16]
dec edi
jne SHORT $LL3#Test89
This is at the place where I call Vector4Dot.Here is the inside of it(the function):
mov eax, DWORD PTR _v2$[esp-4]
movaps xmm0, XMMWORD PTR [edx]
dpps xmm0, XMMWORD PTR [eax], 255 ; 000000ffH
movaps XMMWORD PTR [ecx], xmm0
mov eax, ecx
For case 2 I just get:
movaps xmm0, XMMWORD PTR [eax]
dpps xmm0, xmm0, 255 ; 000000ffH
movaps XMMWORD PTR [eax], xmm0
lea eax, DWORD PTR [eax+16]
dec ecx
jne SHORT $LL3#Test78
Which is a LOT faster.I'm not sure why the compiler can't deal with that constructor.If I change case1 like this:
__m128 Vector4Dot(Vector v1, Vector v2)
{
return(_mm_dp_ps(v1.Entry, v2.Entry, DotMask4));
}
It compiles at maximum speed the same as case 2.It's this "class overhead" that is giving me the performance penalty.Is there any way to get around this?Or am I stuck with using raw __m128's instead of the Vector class?