The performance of multidimensional arrays and arrays of arrays - c++

I always have thought and known that multidimensional arrays to which indexing is done only once by multiplication is faster than arrays of arrays to which indexing is done by two pointer dereferencing, due to better locality and space saving.
I ran a small test a while ago, and the result was quite surprising. At least my callgrind profiler reported that the same function using array of arrays run slightly faster.
I wonder whether I should change the definition of my matrix class to use an array of arrays internally. This class is used virtually everywhere in my simulation engine (? not exactly sure how to call..), and I do want to find the best way to save a few seconds.
test_matrix has the cost of 350 200 020 and test_array_array has the cost of 325 200 016. The code was compiled with -O3 by clang++. All member functions are inlined according to the profiler.
#include <iostream>
#include <memory>
template<class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template<class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size]) {}
template<class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T &operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template<class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes {i, j} {}
template<class T>
T &Matrix<T>::operator()(std::size_t i, std::size_t j) const {
return (*this)[get_index(i, j)];
}
template<class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const {
return i * get_size(2) + j;
}
template<class T>
std::size_t Matrix<T>::get_size(std::size_t d) const {
return sizes[d - 1];
}
template<class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template<class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size) {}
template<class T>
std::size_t Array<T>::get_size() const {
return size;
}
static void __attribute__((noinline)) test_matrix(const Matrix<int> &m) {
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
static_cast<volatile void>(m(i, j) = i + j);
}
}
}
static void __attribute__((noinline))
test_array_array(const Array<Array<int>> &aa) {
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
static_cast<volatile void>(aa[i][j] = i + j);
}
}
}
int main() {
constexpr int N = 1000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
test_matrix(m);
test_array_array(aa);
}

The performance of the two approach is nearly the same because the inner-most loop can optimized the same way in both cases and the computation is likely memory-bound. This means the overhead of the indirection is diluted in the rest of the computation which take most of the time and is subject to variations that can actually be bigger than the overhead. Thus the benchmark is not very sensitive to the difference between the two methods. Here is the assembly code of the inner-most loop (left side: matrix, right side: array of array):
.LBB0_17: .LBB1_30:
movdqa xmm5, xmm1 movdqa xmm5, xmm1
paddq xmm5, xmm4 paddq xmm5, xmm4
movdqa xmm6, xmm0 movdqa xmm6, xmm0
paddq xmm6, xmm4 paddq xmm6, xmm4
shufps xmm5, xmm6, 136 shufps xmm5, xmm6, 136
movdqa xmm6, xmm3 movdqa xmm6, xmm3
paddq xmm6, xmm1 paddq xmm6, xmm1
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm0 paddq xmm7, xmm0
shufps xmm6, xmm7, 136 shufps xmm6, xmm7, 136
movups xmmword ptr [rdi + 4*rbx - 48], xmm5 movups xmmword ptr [rsi + 4*rcx], xmm5
movups xmmword ptr [rdi + 4*rbx - 32], xmm6 movups xmmword ptr [rsi + 4*rcx + 16], xmm6
movdqa xmm5, xmm0 movdqa xmm5, xmm0
paddq xmm5, xmm10 paddq xmm5, xmm10
movdqa xmm6, xmm1 movdqa xmm6, xmm1
paddq xmm6, xmm10 paddq xmm6, xmm10
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm6 paddq xmm7, xmm6
paddq xmm6, xmm4 paddq xmm6, xmm4
movdqa xmm2, xmm3 movdqa xmm2, xmm3
paddq xmm2, xmm5 paddq xmm2, xmm5
paddq xmm5, xmm4 paddq xmm5, xmm4
shufps xmm6, xmm5, 136 shufps xmm6, xmm5, 136
shufps xmm7, xmm2, 136 shufps xmm7, xmm2, 136
movups xmmword ptr [rdi + 4*rbx - 16], xmm6 movups xmmword ptr [rsi + 4*rcx + 32], xmm6
movups xmmword ptr [rdi + 4*rbx], xmm7 movups xmmword ptr [rsi + 4*rcx + 48], xmm7
add rbx, 16 add rcx, 16
paddq xmm1, xmm11 paddq xmm1, xmm11
paddq xmm0, xmm11 paddq xmm0, xmm11
add rbp, 2 add rax, 2
jne .LBB0_17 jne .LBB1_30
As we can see, the loop basically contains the same instructions for the two methods. The order of the stores (movups) is not the same but this should not impact the execution time (especially if the array is aligned in memory). The same thing applies for the different register names. The loop is vectorized using SIMD instructions (SSE) and unrolled 4 times so it can be pretty fast (4 items can be computed per SIMD unit and 16 items per iteration). About 62 iterations are needed for the inner-most loop to complete.
That being said, in both cases, the loops writes 4*1000*1000 = 3.81 MiB of data. This typically fits in the L3 cache on relatively recent processors (or the RAM on old processors). The throughput of the L3/RAM is limited from a core (far lower than the L1 or even the L2 cache) so 1 core will likely stall waiting for the memory hierarchy to be ready. As a result, the loop are not so fast since they spend most of the time waiting for the memory hierarchy. Hardware prefetchers are pretty efficient on modern x86-64 processors so they can prefetech data before a core actually request it, especially for stores and if the written data is contiguous.
The array of array method is generally less efficient because each sub-array is not guaranteed to be allocated contiguously. Modern memory allocators typically use a bucket-based strategy to find memory blocks fitting to the requested size. In a program like this benchmark, the requested memory can be contiguous (or very close to be) since all the arrays are allocated in a raw and the bucket memory is generally not fragmented when a program starts. However, when the memory is fragmented, the arrays tends to be located in non-contiguous regions causing an effect called memory diffusion. Memory diffusion makes things harder for prefetchers to be efficient causing less efficient load/store. This is generally especially true for loads, but stores also cause loads here on most x86-64 processors (Intel processors or recent AMD ones) due to the write-allocate cache policy. Put it shortly, this is one main reason why the array of array method is generally less efficient in application. But this is not the only one : the other comes from the indirections.
The overhead of the additional indirections is pretty small in this benchmark mainly because of the memory-bound inner-loop. The pointers of the sub-arrays are stored contiguously so they can fit in the L1 cache and be efficiently prefetched. This means the indirections can be fast because they are unlikely to cause a cache miss. The indirection instruction cause additional load instructions but since most of the time in waiting the L3 cache or the RAM, the overhead of such instructions is very small if not even negligible. Indeed, modern processors execute instruction in parallel and in an out-of-order way, so the L1 access can be overlapped with L3/RAM load/stores. For example, Intel processors have dedicated units for that: the Line Fill Buffers (between the L1 and L2 caches), the Super-Queue (between the L2 and L3 cache) and the Integrated Memory Controller (between the L3 and the RAM). Most operations are done kind of asynchronously. That being said, things start to be synchronous when cores stall waiting on incoming data or buffers/queues are saturated.
This is possible with a smaller inner-most loop or if the 2D array is travelled non-contiguously. Indeed, if the inner-most loop only compute few items or if it is even replaced with 1 statement, then the overhead of the indirections are much more visible. The processor cannot (easily) overlap the overhead and the array of array method become slower than the matrix-based approach. here is the result of this new benchmark. The gap between the two method seems small but one should keep in mind that the cache is hot during the benchmark while it may not be in a real-world applications. Having a cold cache benefits to the matrix-based method which need fewer data to be loaded from the cache (no need to load the array of pointers).
To understand why the gap is not so huge, we need to analyse the assembly code again. The full assembly code can be seen on Godbolt. Clang use 3 different strategy to speed up the loop (SIMD, scalar+unrolling and scalar) but the unrolled one is the one that should be actually used in this case. Here is the hot loop for the matrix-based method:
.LBB0_27:
mov dword ptr [r12 + rdi], esi
lea ebx, [rsi + 1]
mov dword ptr [r12 + rdx], ebx
lea ebx, [rsi + 2]
mov dword ptr [r12 + rcx], ebx
lea ebx, [rsi + 3]
mov dword ptr [r12 + rax], ebx
add rsi, 4
add r12, r8
cmp rsi, r9
jne .LBB0_27
Here is the one for the array of array:
.LBB1_28:
mov rbp, qword ptr [rdi - 48]
mov dword ptr [rbp], edx
mov rbp, qword ptr [rdi - 32]
lea ebx, [rdx + 1]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi - 16]
lea ebx, [rdx + 2]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi]
lea ebx, [rdx + 3]
mov dword ptr [rbp], ebx
add rdx, 4
add rdi, 64
cmp rsi, rdx
jne .LBB1_28
At first glance, the second one seems clearly less efficient because there is far more instructions to execute. But as said previously, modern processors execute instructions in parallel. Thus, the instruction dependencies and especially the critical path play a significant role in the resulting performance (eg. dependency chains), not to mention the saturation of the processor units en more specifically the saturation of back-end ports of computing cores. Since the performance of this loop is strongly dependent of the target of the target architecture, we should consider a specific processor architecture in order to analyse how fast each method is in this case. Lets choose a relatively-recent mainstream architecture: Intel CoffeeLake.
The first loop is clearly bounded by the store instructions (mov dword ptr [...], ...) since there is only 1 store port on this architecture while lea and add instruction can be executed on multiple ports (and the cmp+jne is cheap because it can be macro-fused and predicted). The loop should take 4 cycles per iteration unless it is bound by the memory hierarchy.
The second loop is more complex but it is also bounded by the store instructions mov dword ptr [rbp], edx. Indeed, CofeeLake has two load ports so 2 mov rbp, qword ptr [...] instructions can be executed per cycle; the same thing is true for the lea which can also be executed on 2 ports; the add and cmp+jne are still cheap. The amount of instruction is not sufficiently big so to saturate the front-end so ports are the bottleneck here. In the end, the loop also takes 4 cycles per iteration assuming the memory hierarchy is not a problem. The thing is the scheduling of the instructions is not always perfect in practice so the dependencies to load instruction can introduce a significant latency if something goes wrong. Since there is a higher pressure on the memory hierarchy, a cache miss would cause the second loop to stall for many cycles as opposed to the first loop (which only do writes). Not to mention a cache miss is more likely to happen in the second case since there is a 8KB buffer of pointers to keep in the L1 cache for this computation to be fast: loading items from the L2 takes a dozen of cycle and loading data to the L3 can cause some cache-lines to be evicted. This is why the second loop is slightly slower in this new benchmark.
What if we use another processor? The result can be significantly different, especially since IceLake (Intel) and Zen2 (AMD) as they have 2 store ports. Things are pretty difficult to analyse on such processors (since not a unique port may be the bottleneck nor actually the back-end at all). This is especially true for Zen2/Zen3 having a 2 shared load/store ports and one dedicated only to stores (meaning 2 loads + 1 store scheduled in 1 cycle, or 1 load + 2 stores, or no load + 3 stores). Thus, the best is certainly to run practical benchmarks on such platforms while taking care to avoid benchmarking biases.
Note that the memory alignment of the sub-array is pretty critical too. With N=1024, the matrix-based method can be significantly slower. This is because the memory layout of the matrix-based method is likely to cause cache trashing in this case while the array-of-array-based method typically adds some padding preventing this issue in this very specific case. The thing is the added padding is typically sizeof(size_t) for mainstream bucket-based allocators so the issue is just happening for another value of N and not really prevented. In fact, for N=1022, the array-of-array-based method is significantly slower. This perfectly match with the above explanation since sizeof(size_t) = 2*sizeof(int) = 8 on the target machine (64-bit). Thus, both methods suffers from this issue but it can be easily controlled with the matrix-based method by adding some padding while it cannot be easily controlled with the array-of-array-based method because the implementation of the allocator is dependent of the platform by default.

I haven't looked through your code in a lot of detail. Instead, I tested your implementations against a really simple wrapper around an std::vector, then added a little bit of timing code so I didn't have to run under a profiler to get a meaningful result. Oh, and I really didn't like the code taking a reference to const, then using a cast to void to allow the code to modify the matrix. I certainly can't imagine expecting people to do that in normal use.
The result looked like this:
#include <chrono>
#include <iomanip>
#include <iostream>
#include <memory>
#include <vector>
template <class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template <class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size])
{
}
template <class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T& operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template <class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes { i, j }
{
}
template <class T>
T& Matrix<T>::operator()(std::size_t i, std::size_t j) const
{
return (*this)[get_index(i, j)];
}
template <class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const
{
return i * get_size(2) + j;
}
template <class T>
std::size_t Matrix<T>::get_size(std::size_t d) const
{
return sizes[d - 1];
}
template <class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template <class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size)
{
}
template <class T>
std::size_t Array<T>::get_size() const
{
return size;
}
static void test_matrix(Matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
m(i, j) = i + j;
}
}
}
static void
test_array_array(Array<Array<int>>& aa)
{
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
aa[i][j] = i + j;
}
}
}
namespace JVC {
template <class T>
class matrix {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix(size_t y, size_t x)
: cols(x)
, rows(y)
, data(x * y)
{
}
T& operator()(size_t y, size_t x)
{
return data[y * cols + x];
}
T operator()(size_t y, size_t x) const
{
return data[y * cols + x];
}
std::size_t get_rows() const { return rows; }
std::size_t get_cols() const { return cols; }
};
static void test_matrix(matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_rows(); ++i) {
for (std::size_t j = 0; j < m.get_cols(); ++j) {
m(i, j) = i + j;
}
}
}
}
template <class F, class C>
void do_test(F f, C &c, std::string const &label) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
f(c);
auto stop = high_resolution_clock::now();
std::cout << std::setw(20) << label << " time: ";
std::cout << duration_cast<milliseconds>(stop - start).count() << " ms\n";
}
int main()
{
std::cout.imbue(std::locale(""));
constexpr int N = 20000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
JVC::matrix<int> m2 { N, N };
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
using namespace std::chrono;
do_test(test_matrix, m, "Matrix");
do_test(test_array_array, aa, "array of arrays");
do_test(JVC::test_matrix, m2, "JVC Matrix");
}
And the result looked like this:
Matrix time: 1,893 ms
array of arrays time: 1,812 ms
JVC Matrix time: 620 ms
So, a trivial wrapper around std::vector is faster than either of your implementations by a factor of about 3.
I would suggest that with this much overhead, it's difficult to be at all certain the timing difference you're seeing stems from storage layout.

To my surprise, your tests are basically correct.
They go against historical knowledge too. (see Dynamic Arrays in C—The Wrong Way).
I corroborated the result with Quickbench and the two timings are almost the same.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
I have no other alternative to say that since your code is so regular the compiler is figuring out that you are asking for consecutive equal-sized allocations which can be replaced by a single block, and in turn, later the hardware can predict the access pattern.
However, I tried making N volatile and inserting a bunch of randomly interleaved allocations at initialization and still get the same result.
I even lowered the optimization to -Og and up to -Ofast and incremented N and I am still getting the same result.
It was only when I used benchmark::ClobberMemory that I see a very small but appreciable difference (with clang, but not with GCC).
So it could have to do with the memory access pattern.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
Another thing that did a (small) difference but is important in real applications was to include the initialization step inside the timing, but, still surprisingly, it was only between 5 and 10% (in favor of single block array).
Conclusion: The compiler, or most likely the hardware, must be doing something really amazing.
The fact the pointer-indirection version is never really faster than the block array makes me think that something is reducing one case to the other in effect.
This deserves more research.
Here it is the machine code if someone is interested https://godbolt.org/z/ssGj7aq7j
Afterthought: Before abandoning contiguous arrays I would at least remain suspicious that this result could be an oddity for 2 dimensions and it is not valid for structures of 3 or 4 dimensions.
Disclaimer: This is interesting to me because I am implementing a multidimensional array library and I care about performance.
The library is a general of your class Matrix for arbitrary dimensions https://gitlab.com/correaa/boost-multi.

Related

Performance differences in Eigen between auto and Eigen::Ref and concrete type

I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.
I have designed a benchmark like this
static void value(benchmark::State &state) {
for (auto _ : state) {
const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
Eigen::Matrix<double, 9, 1>::Random();
auto start = std::chrono::high_resolution_clock::now();
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;
benchmark::DoNotOptimize(v);
auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds =
std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
}
I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.
The results of this benchmarks are
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 23.4 ns 113 ns 29974946
ref/manual_time 23.0 ns 111 ns 29934053
with_auto/manual_time 23.6 ns 112 ns 29891056
As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 2475 ns 3070 ns 291032
ref/manual_time 2482 ns 3077 ns 289258
with_auto/manual_time 2436 ns 3012 ns 263170
Again, the three cases behave the same.
If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.
The auto case should be the fastest, as it should be skipping all the writings.
The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?
Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?
One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):
mov rbx, rax
mov rax, qword ptr [rsp + 24]
cmp rax, 2
jle .LBB0_17
cmp rax, 5
jle .LBB0_17
cmp rax, 8
jle .LBB0_17
mov rax, qword ptr [rsp + 16]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16] # xmm1 = mem[0],zero
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
movapd xmm2, xmm0
mulpd xmm2, xmm0
movapd xmm3, xmm2
unpckhpd xmm3, xmm2 # xmm3 = xmm3[1],xmm2[1]
addsd xmm3, xmm2
movapd xmm2, xmm1
mulsd xmm2, xmm1
addsd xmm2, xmm3
movapd xmm3, xmm1
mulsd xmm3, xmm2
unpcklpd xmm2, xmm2 # xmm2 = xmm2[0,0]
mulpd xmm2, xmm0
addpd xmm2, xmm0
movapd xmmword ptr [rsp + 32], xmm2
addsd xmm3, xmm1
movsd qword ptr [rsp + 48], xmm3
This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.
Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.
To check this, I put the code in a function, to look at the assembler:
Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
which turns into this assembler
push rax
mov rax, qword ptr [rsi + 8]
...
mov rax, qword ptr [rsi]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax + 16]
movupd xmm2, xmmword ptr [rax + 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax + 48]
addsd xmm1, qword ptr [rax + 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax + 64]
...
movupd xmmword ptr [rdi], xmm2
addsd xmm3, xmm1
movsd qword ptr [rdi + 16], xmm3
mov rax, rdi
pop rcx
ret
Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.
This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.
Alternative benchmarks
Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:
struct Foo
{
public:
Eigen::Vector3d v0;
Eigen::Vector3d v1;
Eigen::Vector3d v2;
Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
Eigen::Vector3d operator()() const __attribute__((noinline));
};
Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
v1(vs.segment<3>(3)),
v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
const Eigen::Vector3d vt = v0 + v1 + v2;
return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
Foo f(vs);
return f();
}
By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.
Purpose of Ref
Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:
Eigen::VectorXd foo(const Eigen::VectorXd&)
This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write
void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);
and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.
template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
const Eigen::MatrixBase<Derived2>& in)
{
Eigen::MatrixBase<Derived1>& mutable_out =
const_cast<Eigen::MatrixBase<Derived1>&>(out);
mutable_out = ...;
}
A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.
Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.
The only reasons to use Ref inside a single cpp file are
To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though
Use with fixed-size types
Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.
void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);
int main()
{
Eigen::VectorXd in = ...;
foo(in.segment<3>(6));
bar(in.segment<3>(6));
}
The following things will happen when you call foo:
We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)
The following happens when we call bar:
We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
A pointer to this pointer is passed to bar
bar loads the pointer from the stack, then loads the values
Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.
We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?

Eigen::VectorXd::operator += seems ~69% slower than looping through a std::vector

The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:
#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>
auto constexpr N = 1024u;
template <typename TVector>
TVector generate(unsigned min) {
TVector v(N);
for (unsigned i = 0; i < N; ++i)
v[i] = static_cast<double>(min + i);
return v;
}
auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
v1 += v2;
}
void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
static void eigen(benchmark::State& state) {
for (auto _ : state) {
add_vectors(ev1, ev2);
benchmark::DoNotOptimize(ev1);
}
}
static void standard(benchmark::State& state) {
for (auto _ : state) {
add_vectors(sv1, sv2);
benchmark::DoNotOptimize(sv1);
}
}
BENCHMARK(standard);
BENCHMARK(eigen);
I'm running it on Intel Xeon E-2286M #2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 262K (x8)
L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------
standard 99 ns 100 ns 7466667
eigen 169 ns 169 ns 4072727
which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:
// For Eigen::VectorXd
00007FF672221A11 vmovupd ymm0,ymmword ptr [rcx+rax*8]
00007FF672221A16 vaddpd ymm1,ymm0,ymmword ptr [r8+rax*8]
00007FF672221A1C vmovupd ymmword ptr [r8+rax*8],ymm1
00007FF672221A22 add rax,4
00007FF672221A26 cmp rax,rdx
00007FF672221A29 jge eigen+0C7h (07FF672221A37h)
00007FF672221A2B mov rcx,qword ptr [rsp+48h]
00007FF672221A30 mov r8,qword ptr [rsp+58h]
00007FF672221A35 jmp eigen+0A1h (07FF672221A11h)
// For std::vector
00007FF672221B40 vmovups ymm1,ymmword ptr [rax+rdx-20h]
00007FF672221B46 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx-20h]
00007FF672221B4C vmovups ymmword ptr [rax+rcx-20h],ymm1
00007FF672221B52 vmovups ymm1,ymmword ptr [rax+rdx]
00007FF672221B57 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx]
00007FF672221B5C vmovups ymmword ptr [rax+rcx],ymm1
00007FF672221B61 lea rax,[rax+40h]
00007FF672221B65 sub r8,1
00007FF672221B69 jne standard+0C0h (07FF672221B40h)
One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).
FWIW: I've added /Qvec-report:2 and the compiler reports:
[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'
and reason 1305 means "Not enough type information."
My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?
TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.
Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).
Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).
Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.
Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.
Testing the hypothesises
If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:
$LL24#standard:
vmovsd xmm0, QWORD PTR [r9+rcx*8]
vaddsd xmm1, xmm0, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r8+rcx*8], xmm1
mov rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
inc edx
sub rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
sar rax, 3
mov ecx, edx
cmp rcx, rax
jb SHORT $LL24#standard
If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:
$LL24#eigen:
vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
vaddpd xmm1, xmm1, XMMWORD PTR [rcx-16]
vmovupd xmm2, XMMWORD PTR [rcx+rdx]
vmovupd XMMWORD PTR [rcx-16], xmm1
vaddpd xmm1, xmm2, XMMWORD PTR [rcx]
vmovupd XMMWORD PTR [rcx], xmm1
vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
vaddpd ymm1, ymm1, YMMWORD PTR [rcx+16]
vmovups YMMWORD PTR [rcx+16], ymm1
lea rcx, QWORD PTR [rcx+64]
sub rax, 1
jne SHORT $LL24#eigen
Additional notes
It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).
Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.
To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
#pragma omp simd
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:
$LL26#eigen:
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
lea rdx, QWORD PTR [rdx+128]
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
sub r8, 1
jne $LL26#eigen
This code is still not perfect due to the unexpected useless mov instructions.
Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.
Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.

Combining __restrict__ and __attribute__((aligned(32)))

I want to ensure that gcc knows:
The pointers refer to non-overlapping chunks of memory
The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3"
It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt).
Still unaligned moves.
No, using float *__attribute__((aligned(32))) x means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1
There is a way to do this, but it only helps for gcc, not clang or ICC.
See How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about __attribute__((aligned(32))), which does work for GCC.
I used __restrict instead of __restrict__ because that C++ extension name for C99 restrict is portable to all the mainstream x86 C++ compilers, including MSVC.
typedef float aligned32_float __attribute__((aligned(32)));
void prod(const aligned32_float * __restrict x,
const aligned32_float * __restrict y,
int size,
aligned32_float* __restrict out0)
{
size &= -16ULL;
#if 0 // this works for clang, ICC, and GCC
x = (const float*)__builtin_assume_aligned(x, 32); // have to cast the result in C++
y = (const float*)__builtin_assume_aligned(y, 32);
out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif
for (int i = 0; i < size; ++i) {
out0[i] = x[i] * y[i]; // auto-vectorized with a memory operand for mulps
// note clang using two separate movups loads
// instead of a memory operand for mulps
}
}
(gcc, clang, and ICC output on the Godbolt compiler explorer).
GCC and clang will use movaps / vmovaps instead of ups any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never use movaps for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the -mavx256-split-unaligned-load/store effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.
vmovups is not a performance problem when used on aligned memory; it performs identically to vmovaps on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your -march=haswell output. Only older CPUs, before Nehalem and Bulldozer, always decoded movups to multiple uops.
The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for mulps unless it's aligned.
A good test case for this is out0[i] = x[i] * y[i], where the load result is only needed once. Or out0[i] *= x[i]. Knowing alignment enables movaps/mulps xmm0, [rsi], otherwise it's 2x movups + mulps. You can check for this optimization even on compilers like ICC or MSVC, which use movups even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.
It seems __builtin_assume_aligned is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers to struct aligned_floats { alignas(32) float f[8]; };, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back to float *
I try to use one read and lots of write to saturate the cpu ports for writing.
Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.
If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.
Semi-related: a 2x load + 1x store memory access pattern like c[i] = a[i]+b[i] or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.
But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.
BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.
Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an and rsp,-32 because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)
<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

C++ performance std::array vs std::vector

Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2

Optimisation of a write out of a loop

Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]
class X
{
public:
void process(float * __restrict d, int size)
{
for (int i = 0; i < size; ++i)
{
d[i] = v * c + d[i];
v = d[i];
}
}
void processFaster(float * __restrict d, int size)
{
float lv = v;
for (int i = 0; i < size; ++i)
{
d[i] = lv * c + d[i];
lv = d[i];
}
v = lv;
}
float c{0.0f};
float v{0.0f};
};
With gcc -O3 the first one has an inner loop that looks like:
.L3:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rax, rdi
movss DWORD PTR x[rip+4], xmm0 ;<<< the extra store
jne .L3
.L1:
rep ret
The second here:
.L8:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rdi, rax
jne .L8
.L7:
movss DWORD PTR x[rip+4], xmm0
ret
See https://godbolt.org/g/a9nCP2 for the complete code.
Why does the compiler not perform the lv optimisation here?
I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.
Am I right to make that assumption?
I think the observable behaviour should be the same in both cases.
This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.
The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.
The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).
In the second case there is no external visible effect of v so it doesn't need to store it.
In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.
If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.