I want to ensure that gcc knows:
The pointers refer to non-overlapping chunks of memory
The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3"
It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt).
Still unaligned moves.
No, using float *__attribute__((aligned(32))) x means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1
There is a way to do this, but it only helps for gcc, not clang or ICC.
See How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about __attribute__((aligned(32))), which does work for GCC.
I used __restrict instead of __restrict__ because that C++ extension name for C99 restrict is portable to all the mainstream x86 C++ compilers, including MSVC.
typedef float aligned32_float __attribute__((aligned(32)));
void prod(const aligned32_float * __restrict x,
const aligned32_float * __restrict y,
int size,
aligned32_float* __restrict out0)
{
size &= -16ULL;
#if 0 // this works for clang, ICC, and GCC
x = (const float*)__builtin_assume_aligned(x, 32); // have to cast the result in C++
y = (const float*)__builtin_assume_aligned(y, 32);
out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif
for (int i = 0; i < size; ++i) {
out0[i] = x[i] * y[i]; // auto-vectorized with a memory operand for mulps
// note clang using two separate movups loads
// instead of a memory operand for mulps
}
}
(gcc, clang, and ICC output on the Godbolt compiler explorer).
GCC and clang will use movaps / vmovaps instead of ups any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never use movaps for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the -mavx256-split-unaligned-load/store effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.
vmovups is not a performance problem when used on aligned memory; it performs identically to vmovaps on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your -march=haswell output. Only older CPUs, before Nehalem and Bulldozer, always decoded movups to multiple uops.
The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for mulps unless it's aligned.
A good test case for this is out0[i] = x[i] * y[i], where the load result is only needed once. Or out0[i] *= x[i]. Knowing alignment enables movaps/mulps xmm0, [rsi], otherwise it's 2x movups + mulps. You can check for this optimization even on compilers like ICC or MSVC, which use movups even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.
It seems __builtin_assume_aligned is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers to struct aligned_floats { alignas(32) float f[8]; };, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back to float *
I try to use one read and lots of write to saturate the cpu ports for writing.
Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.
If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.
Semi-related: a 2x load + 1x store memory access pattern like c[i] = a[i]+b[i] or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.
But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.
BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.
Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an and rsp,-32 because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)
<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
Related
The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:
#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>
auto constexpr N = 1024u;
template <typename TVector>
TVector generate(unsigned min) {
TVector v(N);
for (unsigned i = 0; i < N; ++i)
v[i] = static_cast<double>(min + i);
return v;
}
auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
v1 += v2;
}
void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
static void eigen(benchmark::State& state) {
for (auto _ : state) {
add_vectors(ev1, ev2);
benchmark::DoNotOptimize(ev1);
}
}
static void standard(benchmark::State& state) {
for (auto _ : state) {
add_vectors(sv1, sv2);
benchmark::DoNotOptimize(sv1);
}
}
BENCHMARK(standard);
BENCHMARK(eigen);
I'm running it on Intel Xeon E-2286M #2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 262K (x8)
L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------
standard 99 ns 100 ns 7466667
eigen 169 ns 169 ns 4072727
which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:
// For Eigen::VectorXd
00007FF672221A11 vmovupd ymm0,ymmword ptr [rcx+rax*8]
00007FF672221A16 vaddpd ymm1,ymm0,ymmword ptr [r8+rax*8]
00007FF672221A1C vmovupd ymmword ptr [r8+rax*8],ymm1
00007FF672221A22 add rax,4
00007FF672221A26 cmp rax,rdx
00007FF672221A29 jge eigen+0C7h (07FF672221A37h)
00007FF672221A2B mov rcx,qword ptr [rsp+48h]
00007FF672221A30 mov r8,qword ptr [rsp+58h]
00007FF672221A35 jmp eigen+0A1h (07FF672221A11h)
// For std::vector
00007FF672221B40 vmovups ymm1,ymmword ptr [rax+rdx-20h]
00007FF672221B46 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx-20h]
00007FF672221B4C vmovups ymmword ptr [rax+rcx-20h],ymm1
00007FF672221B52 vmovups ymm1,ymmword ptr [rax+rdx]
00007FF672221B57 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx]
00007FF672221B5C vmovups ymmword ptr [rax+rcx],ymm1
00007FF672221B61 lea rax,[rax+40h]
00007FF672221B65 sub r8,1
00007FF672221B69 jne standard+0C0h (07FF672221B40h)
One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).
FWIW: I've added /Qvec-report:2 and the compiler reports:
[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'
and reason 1305 means "Not enough type information."
My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?
TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.
Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).
Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).
Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.
Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.
Testing the hypothesises
If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:
$LL24#standard:
vmovsd xmm0, QWORD PTR [r9+rcx*8]
vaddsd xmm1, xmm0, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r8+rcx*8], xmm1
mov rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
inc edx
sub rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
sar rax, 3
mov ecx, edx
cmp rcx, rax
jb SHORT $LL24#standard
If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:
$LL24#eigen:
vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
vaddpd xmm1, xmm1, XMMWORD PTR [rcx-16]
vmovupd xmm2, XMMWORD PTR [rcx+rdx]
vmovupd XMMWORD PTR [rcx-16], xmm1
vaddpd xmm1, xmm2, XMMWORD PTR [rcx]
vmovupd XMMWORD PTR [rcx], xmm1
vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
vaddpd ymm1, ymm1, YMMWORD PTR [rcx+16]
vmovups YMMWORD PTR [rcx+16], ymm1
lea rcx, QWORD PTR [rcx+64]
sub rax, 1
jne SHORT $LL24#eigen
Additional notes
It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).
Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.
To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
#pragma omp simd
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:
$LL26#eigen:
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
lea rdx, QWORD PTR [rdx+128]
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
sub r8, 1
jne $LL26#eigen
This code is still not perfect due to the unexpected useless mov instructions.
Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.
Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
Here's some code which GCC 6 and 7 fail to optimize when using std::array:
#include <array>
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef std::array<double, my_elements> Vec alignas(32);
#endif
void fun1(const Vec&);
Vec v1{{}};
};
void Foo::fun1(const Vec& __restrict__ v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}
Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:
vmovapd ymm0, YMMWORD PTR [rdi]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi]
vmovapd YMMWORD PTR [rdi], ymm0
vmovapd ymm0, YMMWORD PTR [rdi+32]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovapd YMMWORD PTR [rdi+32], ymm0
vzeroupper
That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY, you get a huge mess starting with this:
mov rax, rdi
shr rax, 3
neg rax
and eax, 3
je .L7
The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.
It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.
Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:
void Foo::fun2(const Vec& __restrict__ v2)
{
typedef double V2 alignas(Foo::Vec);
const V2* v2a = static_cast<const V2*>(&v2[0]);
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2a[i];
}
}
Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.
You can see it live here: https://godbolt.org/g/IXIOst
Interestingly, if you replace v1[i] += v2a[i]; with v1._M_elems[i] += v2._M_elems[i]; (which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.
Possible interpretation: in the gcc dumps (-fdump-tree-all-all), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15] in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1] for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.
This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd that do not require an aligned address.
The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this happens to be aligned.
Gcc also has a builtin to let the compiler know about alignment:
const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));
Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a));.
For instance:
void MyFunction(float y)
{
__m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128
//do whatever it is with 'a'
}
I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles (__m128d)?
Almost every _ss and _ps intrinsic / instruction has a double version with a _sd or _pd suffix. (Scalar Double or Packed Double).
For example, search (double in Intel's intrinsic finder to find intrinsic functions that take a double as the first arg. Or just figure out what optimal asm would be, then look up the intrinsics for those instructions in the insn ref manual. Except that it doesn't list all the intrinsics for movsd, so searching for an instruction name in the intrinsics finder often works.
re: header files: always just include <immintrin.h>. It includes all Intel SSE/AVX intrinsics.
See also ways to put a float into a vector, and the sse tag wiki for links about how to shuffle vectors. (i.e. the tables of shuffle instructions in Agner Fog's optimizing assembly guide)
(see below for a godbolt link to some interesting compiler output)
re: your sequence
Only use _mm_move_ss (or sd) if you actually want to merge two vectors.
You don't show how m is defined. Your use of a as the variable name for the float and the vector imply that the only useful information in the vector is the float arg. The variable-name clash of course means it doesn't compile.
There unfortunately doesn't seem to be any way to just "cast" a float or double into a vector with garbage in the upper 3 elements, like there is for __m128 -> __m256:
__m256 _mm256_castps128_ps256 (__m128 a). I posted a new question about this limitation with intrinsics: How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
I tried using _mm_undefined_ps() to achieve this, hoping this would clue in the compiler that it can just leave the incoming high garbage in place, in
// don't use this, it doesn't make better code
__m128d double_to_vec_highgarbage(double x) {
__m128d undef = _mm_undefined_pd();
__m128d x_zeroupper = _mm_set_sd(x);
return _mm_move_sd(undef, x_zeroupper);
}
but clang3.8 compiles it to
# clang3.8 -O3 -march=core2
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
So no advantage, still zeroing the upper half instead of compiling it to just a ret. gcc actually makes pretty bad code:
double_to_vec_highgarbage: # gcc5.3 -march=nehalem
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm1, QWORD PTR [rsp-16] # D.26885, %sfp
pxor xmm0, xmm0 # __Y
movsd xmm0, xmm1 # tmp93, D.26885
ret
_mm_set_sd appears to be the best way to turn a scalar into a vector.
__m128d double_to_vec(double x) {
return _mm_set_sd(x);
}
clang compiles it to a movq xmm0,xmm0, gcc to a store/reload with -march=generic.
Other interesting compiler outputs from the float and double versions on the Godbolt compiler explorer
float_to_vec: # gcc 5.3 -O3 -march=core2
movd eax, xmm0 # x, x
movd xmm0, eax # D.26867, x
ret
float_to_vec: # gcc5.3 -O3 -march=nehalem
insertps xmm0, xmm0, 0xe # D.26867, x
ret
double_to_vec: # gcc5.3 -O3 -march=nehalem. It could still have use movq or insertps, instead of this longer-latency store-forwarding round trip
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm0, QWORD PTR [rsp-16] # D.26881, %sfp
ret
float_to_vec: # clang3.8 -O3 -march=core2 or generic (no -march)
xorps xmm1, xmm1
movss xmm1, xmm0 # xmm1 = xmm0[0],xmm1[1,2,3]
movaps xmm0, xmm1
ret
double_to_vec: # clang3.8 -O3 -march=core2, nehalem, or generic (no -march)
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
float_to_vec: # clang3.8 -O3 -march=nehalem
xorps xmm1, xmm1
blendps xmm0, xmm1, 14 # xmm0 = xmm0[0],xmm1[1,2,3]
ret
So both clang and gcc use different strategies for float vs. double, even when they could use the same strategy.
Using integer operations like movq between floating-point operations causes extra bypass delay latency. Using insertps to zero the upper elements of the input register should be the best strategy for float or double, so all compilers should use that when SSE4.1 is available. xorps + blend is good, too, and can run on more ports than insertps. The store/reload is probably the worst, unless we're bottlenecked on ALU throughput, and latency doesn't matter.
_mm_move_sd, _mm_set_sd. They're SSE2 intrinsics (and not SSE), so you'll need #include <emmintrin.h>.
I always have thought and known that multidimensional arrays to which indexing is done only once by multiplication is faster than arrays of arrays to which indexing is done by two pointer dereferencing, due to better locality and space saving.
I ran a small test a while ago, and the result was quite surprising. At least my callgrind profiler reported that the same function using array of arrays run slightly faster.
I wonder whether I should change the definition of my matrix class to use an array of arrays internally. This class is used virtually everywhere in my simulation engine (? not exactly sure how to call..), and I do want to find the best way to save a few seconds.
test_matrix has the cost of 350 200 020 and test_array_array has the cost of 325 200 016. The code was compiled with -O3 by clang++. All member functions are inlined according to the profiler.
#include <iostream>
#include <memory>
template<class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template<class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size]) {}
template<class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T &operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template<class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes {i, j} {}
template<class T>
T &Matrix<T>::operator()(std::size_t i, std::size_t j) const {
return (*this)[get_index(i, j)];
}
template<class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const {
return i * get_size(2) + j;
}
template<class T>
std::size_t Matrix<T>::get_size(std::size_t d) const {
return sizes[d - 1];
}
template<class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template<class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size) {}
template<class T>
std::size_t Array<T>::get_size() const {
return size;
}
static void __attribute__((noinline)) test_matrix(const Matrix<int> &m) {
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
static_cast<volatile void>(m(i, j) = i + j);
}
}
}
static void __attribute__((noinline))
test_array_array(const Array<Array<int>> &aa) {
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
static_cast<volatile void>(aa[i][j] = i + j);
}
}
}
int main() {
constexpr int N = 1000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
test_matrix(m);
test_array_array(aa);
}
The performance of the two approach is nearly the same because the inner-most loop can optimized the same way in both cases and the computation is likely memory-bound. This means the overhead of the indirection is diluted in the rest of the computation which take most of the time and is subject to variations that can actually be bigger than the overhead. Thus the benchmark is not very sensitive to the difference between the two methods. Here is the assembly code of the inner-most loop (left side: matrix, right side: array of array):
.LBB0_17: .LBB1_30:
movdqa xmm5, xmm1 movdqa xmm5, xmm1
paddq xmm5, xmm4 paddq xmm5, xmm4
movdqa xmm6, xmm0 movdqa xmm6, xmm0
paddq xmm6, xmm4 paddq xmm6, xmm4
shufps xmm5, xmm6, 136 shufps xmm5, xmm6, 136
movdqa xmm6, xmm3 movdqa xmm6, xmm3
paddq xmm6, xmm1 paddq xmm6, xmm1
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm0 paddq xmm7, xmm0
shufps xmm6, xmm7, 136 shufps xmm6, xmm7, 136
movups xmmword ptr [rdi + 4*rbx - 48], xmm5 movups xmmword ptr [rsi + 4*rcx], xmm5
movups xmmword ptr [rdi + 4*rbx - 32], xmm6 movups xmmword ptr [rsi + 4*rcx + 16], xmm6
movdqa xmm5, xmm0 movdqa xmm5, xmm0
paddq xmm5, xmm10 paddq xmm5, xmm10
movdqa xmm6, xmm1 movdqa xmm6, xmm1
paddq xmm6, xmm10 paddq xmm6, xmm10
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm6 paddq xmm7, xmm6
paddq xmm6, xmm4 paddq xmm6, xmm4
movdqa xmm2, xmm3 movdqa xmm2, xmm3
paddq xmm2, xmm5 paddq xmm2, xmm5
paddq xmm5, xmm4 paddq xmm5, xmm4
shufps xmm6, xmm5, 136 shufps xmm6, xmm5, 136
shufps xmm7, xmm2, 136 shufps xmm7, xmm2, 136
movups xmmword ptr [rdi + 4*rbx - 16], xmm6 movups xmmword ptr [rsi + 4*rcx + 32], xmm6
movups xmmword ptr [rdi + 4*rbx], xmm7 movups xmmword ptr [rsi + 4*rcx + 48], xmm7
add rbx, 16 add rcx, 16
paddq xmm1, xmm11 paddq xmm1, xmm11
paddq xmm0, xmm11 paddq xmm0, xmm11
add rbp, 2 add rax, 2
jne .LBB0_17 jne .LBB1_30
As we can see, the loop basically contains the same instructions for the two methods. The order of the stores (movups) is not the same but this should not impact the execution time (especially if the array is aligned in memory). The same thing applies for the different register names. The loop is vectorized using SIMD instructions (SSE) and unrolled 4 times so it can be pretty fast (4 items can be computed per SIMD unit and 16 items per iteration). About 62 iterations are needed for the inner-most loop to complete.
That being said, in both cases, the loops writes 4*1000*1000 = 3.81 MiB of data. This typically fits in the L3 cache on relatively recent processors (or the RAM on old processors). The throughput of the L3/RAM is limited from a core (far lower than the L1 or even the L2 cache) so 1 core will likely stall waiting for the memory hierarchy to be ready. As a result, the loop are not so fast since they spend most of the time waiting for the memory hierarchy. Hardware prefetchers are pretty efficient on modern x86-64 processors so they can prefetech data before a core actually request it, especially for stores and if the written data is contiguous.
The array of array method is generally less efficient because each sub-array is not guaranteed to be allocated contiguously. Modern memory allocators typically use a bucket-based strategy to find memory blocks fitting to the requested size. In a program like this benchmark, the requested memory can be contiguous (or very close to be) since all the arrays are allocated in a raw and the bucket memory is generally not fragmented when a program starts. However, when the memory is fragmented, the arrays tends to be located in non-contiguous regions causing an effect called memory diffusion. Memory diffusion makes things harder for prefetchers to be efficient causing less efficient load/store. This is generally especially true for loads, but stores also cause loads here on most x86-64 processors (Intel processors or recent AMD ones) due to the write-allocate cache policy. Put it shortly, this is one main reason why the array of array method is generally less efficient in application. But this is not the only one : the other comes from the indirections.
The overhead of the additional indirections is pretty small in this benchmark mainly because of the memory-bound inner-loop. The pointers of the sub-arrays are stored contiguously so they can fit in the L1 cache and be efficiently prefetched. This means the indirections can be fast because they are unlikely to cause a cache miss. The indirection instruction cause additional load instructions but since most of the time in waiting the L3 cache or the RAM, the overhead of such instructions is very small if not even negligible. Indeed, modern processors execute instruction in parallel and in an out-of-order way, so the L1 access can be overlapped with L3/RAM load/stores. For example, Intel processors have dedicated units for that: the Line Fill Buffers (between the L1 and L2 caches), the Super-Queue (between the L2 and L3 cache) and the Integrated Memory Controller (between the L3 and the RAM). Most operations are done kind of asynchronously. That being said, things start to be synchronous when cores stall waiting on incoming data or buffers/queues are saturated.
This is possible with a smaller inner-most loop or if the 2D array is travelled non-contiguously. Indeed, if the inner-most loop only compute few items or if it is even replaced with 1 statement, then the overhead of the indirections are much more visible. The processor cannot (easily) overlap the overhead and the array of array method become slower than the matrix-based approach. here is the result of this new benchmark. The gap between the two method seems small but one should keep in mind that the cache is hot during the benchmark while it may not be in a real-world applications. Having a cold cache benefits to the matrix-based method which need fewer data to be loaded from the cache (no need to load the array of pointers).
To understand why the gap is not so huge, we need to analyse the assembly code again. The full assembly code can be seen on Godbolt. Clang use 3 different strategy to speed up the loop (SIMD, scalar+unrolling and scalar) but the unrolled one is the one that should be actually used in this case. Here is the hot loop for the matrix-based method:
.LBB0_27:
mov dword ptr [r12 + rdi], esi
lea ebx, [rsi + 1]
mov dword ptr [r12 + rdx], ebx
lea ebx, [rsi + 2]
mov dword ptr [r12 + rcx], ebx
lea ebx, [rsi + 3]
mov dword ptr [r12 + rax], ebx
add rsi, 4
add r12, r8
cmp rsi, r9
jne .LBB0_27
Here is the one for the array of array:
.LBB1_28:
mov rbp, qword ptr [rdi - 48]
mov dword ptr [rbp], edx
mov rbp, qword ptr [rdi - 32]
lea ebx, [rdx + 1]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi - 16]
lea ebx, [rdx + 2]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi]
lea ebx, [rdx + 3]
mov dword ptr [rbp], ebx
add rdx, 4
add rdi, 64
cmp rsi, rdx
jne .LBB1_28
At first glance, the second one seems clearly less efficient because there is far more instructions to execute. But as said previously, modern processors execute instructions in parallel. Thus, the instruction dependencies and especially the critical path play a significant role in the resulting performance (eg. dependency chains), not to mention the saturation of the processor units en more specifically the saturation of back-end ports of computing cores. Since the performance of this loop is strongly dependent of the target of the target architecture, we should consider a specific processor architecture in order to analyse how fast each method is in this case. Lets choose a relatively-recent mainstream architecture: Intel CoffeeLake.
The first loop is clearly bounded by the store instructions (mov dword ptr [...], ...) since there is only 1 store port on this architecture while lea and add instruction can be executed on multiple ports (and the cmp+jne is cheap because it can be macro-fused and predicted). The loop should take 4 cycles per iteration unless it is bound by the memory hierarchy.
The second loop is more complex but it is also bounded by the store instructions mov dword ptr [rbp], edx. Indeed, CofeeLake has two load ports so 2 mov rbp, qword ptr [...] instructions can be executed per cycle; the same thing is true for the lea which can also be executed on 2 ports; the add and cmp+jne are still cheap. The amount of instruction is not sufficiently big so to saturate the front-end so ports are the bottleneck here. In the end, the loop also takes 4 cycles per iteration assuming the memory hierarchy is not a problem. The thing is the scheduling of the instructions is not always perfect in practice so the dependencies to load instruction can introduce a significant latency if something goes wrong. Since there is a higher pressure on the memory hierarchy, a cache miss would cause the second loop to stall for many cycles as opposed to the first loop (which only do writes). Not to mention a cache miss is more likely to happen in the second case since there is a 8KB buffer of pointers to keep in the L1 cache for this computation to be fast: loading items from the L2 takes a dozen of cycle and loading data to the L3 can cause some cache-lines to be evicted. This is why the second loop is slightly slower in this new benchmark.
What if we use another processor? The result can be significantly different, especially since IceLake (Intel) and Zen2 (AMD) as they have 2 store ports. Things are pretty difficult to analyse on such processors (since not a unique port may be the bottleneck nor actually the back-end at all). This is especially true for Zen2/Zen3 having a 2 shared load/store ports and one dedicated only to stores (meaning 2 loads + 1 store scheduled in 1 cycle, or 1 load + 2 stores, or no load + 3 stores). Thus, the best is certainly to run practical benchmarks on such platforms while taking care to avoid benchmarking biases.
Note that the memory alignment of the sub-array is pretty critical too. With N=1024, the matrix-based method can be significantly slower. This is because the memory layout of the matrix-based method is likely to cause cache trashing in this case while the array-of-array-based method typically adds some padding preventing this issue in this very specific case. The thing is the added padding is typically sizeof(size_t) for mainstream bucket-based allocators so the issue is just happening for another value of N and not really prevented. In fact, for N=1022, the array-of-array-based method is significantly slower. This perfectly match with the above explanation since sizeof(size_t) = 2*sizeof(int) = 8 on the target machine (64-bit). Thus, both methods suffers from this issue but it can be easily controlled with the matrix-based method by adding some padding while it cannot be easily controlled with the array-of-array-based method because the implementation of the allocator is dependent of the platform by default.
I haven't looked through your code in a lot of detail. Instead, I tested your implementations against a really simple wrapper around an std::vector, then added a little bit of timing code so I didn't have to run under a profiler to get a meaningful result. Oh, and I really didn't like the code taking a reference to const, then using a cast to void to allow the code to modify the matrix. I certainly can't imagine expecting people to do that in normal use.
The result looked like this:
#include <chrono>
#include <iomanip>
#include <iostream>
#include <memory>
#include <vector>
template <class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
BasicArray() = default;
BasicArray(std::size_t);
};
template <class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size])
{
}
template <class T>
class Matrix : public BasicArray<T> {
public:
Matrix() = default;
Matrix(std::size_t, std::size_t);
T& operator()(std::size_t, std::size_t) const;
std::size_t get_index(std::size_t, std::size_t) const;
std::size_t get_size(std::size_t) const;
private:
std::size_t sizes[2];
};
template <class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes { i, j }
{
}
template <class T>
T& Matrix<T>::operator()(std::size_t i, std::size_t j) const
{
return (*this)[get_index(i, j)];
}
template <class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const
{
return i * get_size(2) + j;
}
template <class T>
std::size_t Matrix<T>::get_size(std::size_t d) const
{
return sizes[d - 1];
}
template <class T>
class Array : public BasicArray<T> {
public:
Array() = default;
Array(std::size_t);
std::size_t get_size() const;
private:
std::size_t size;
};
template <class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size)
{
}
template <class T>
std::size_t Array<T>::get_size() const
{
return size;
}
static void test_matrix(Matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_size(1); ++i) {
for (std::size_t j = 0; j < m.get_size(2); ++j) {
m(i, j) = i + j;
}
}
}
static void
test_array_array(Array<Array<int>>& aa)
{
for (std::size_t i = 0; i < aa.get_size(); ++i) {
for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
aa[i][j] = i + j;
}
}
}
namespace JVC {
template <class T>
class matrix {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix(size_t y, size_t x)
: cols(x)
, rows(y)
, data(x * y)
{
}
T& operator()(size_t y, size_t x)
{
return data[y * cols + x];
}
T operator()(size_t y, size_t x) const
{
return data[y * cols + x];
}
std::size_t get_rows() const { return rows; }
std::size_t get_cols() const { return cols; }
};
static void test_matrix(matrix<int>& m)
{
for (std::size_t i = 0; i < m.get_rows(); ++i) {
for (std::size_t j = 0; j < m.get_cols(); ++j) {
m(i, j) = i + j;
}
}
}
}
template <class F, class C>
void do_test(F f, C &c, std::string const &label) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
f(c);
auto stop = high_resolution_clock::now();
std::cout << std::setw(20) << label << " time: ";
std::cout << duration_cast<milliseconds>(stop - start).count() << " ms\n";
}
int main()
{
std::cout.imbue(std::locale(""));
constexpr int N = 20000;
Matrix<int> m(N, N);
Array<Array<int>> aa(N);
JVC::matrix<int> m2 { N, N };
for (std::size_t i = 0; i < aa.get_size(); ++i) {
aa[i] = Array<int>(N);
}
using namespace std::chrono;
do_test(test_matrix, m, "Matrix");
do_test(test_array_array, aa, "array of arrays");
do_test(JVC::test_matrix, m2, "JVC Matrix");
}
And the result looked like this:
Matrix time: 1,893 ms
array of arrays time: 1,812 ms
JVC Matrix time: 620 ms
So, a trivial wrapper around std::vector is faster than either of your implementations by a factor of about 3.
I would suggest that with this much overhead, it's difficult to be at all certain the timing difference you're seeing stems from storage layout.
To my surprise, your tests are basically correct.
They go against historical knowledge too. (see Dynamic Arrays in C—The Wrong Way).
I corroborated the result with Quickbench and the two timings are almost the same.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
I have no other alternative to say that since your code is so regular the compiler is figuring out that you are asking for consecutive equal-sized allocations which can be replaced by a single block, and in turn, later the hardware can predict the access pattern.
However, I tried making N volatile and inserting a bunch of randomly interleaved allocations at initialization and still get the same result.
I even lowered the optimization to -Og and up to -Ofast and incremented N and I am still getting the same result.
It was only when I used benchmark::ClobberMemory that I see a very small but appreciable difference (with clang, but not with GCC).
So it could have to do with the memory access pattern.
https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA
Another thing that did a (small) difference but is important in real applications was to include the initialization step inside the timing, but, still surprisingly, it was only between 5 and 10% (in favor of single block array).
Conclusion: The compiler, or most likely the hardware, must be doing something really amazing.
The fact the pointer-indirection version is never really faster than the block array makes me think that something is reducing one case to the other in effect.
This deserves more research.
Here it is the machine code if someone is interested https://godbolt.org/z/ssGj7aq7j
Afterthought: Before abandoning contiguous arrays I would at least remain suspicious that this result could be an oddity for 2 dimensions and it is not valid for structures of 3 or 4 dimensions.
Disclaimer: This is interesting to me because I am implementing a multidimensional array library and I care about performance.
The library is a general of your class Matrix for arbitrary dimensions https://gitlab.com/correaa/boost-multi.