Related
The below code (needs google benchmark) fills up two vectors and adds them up, storing the result in the first one. For the vector types I've used Eigen::VectorXd and std::vector for performance comparison:
#include <Eigen/Core>
#include <benchmark/benchmark.h>
#include <vector>
auto constexpr N = 1024u;
template <typename TVector>
TVector generate(unsigned min) {
TVector v(N);
for (unsigned i = 0; i < N; ++i)
v[i] = static_cast<double>(min + i);
return v;
}
auto ev1 = generate<Eigen::VectorXd>(0);
auto ev2 = generate<Eigen::VectorXd>(N);
auto sv1 = generate<std::vector<double>>(0);
auto sv2 = generate<std::vector<double>>(N);
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
v1 += v2;
}
void add_vectors(std::vector<double>& v1, std::vector<double> const& v2) {
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
static void eigen(benchmark::State& state) {
for (auto _ : state) {
add_vectors(ev1, ev2);
benchmark::DoNotOptimize(ev1);
}
}
static void standard(benchmark::State& state) {
for (auto _ : state) {
add_vectors(sv1, sv2);
benchmark::DoNotOptimize(sv1);
}
}
BENCHMARK(standard);
BENCHMARK(eigen);
I'm running it on Intel Xeon E-2286M #2.40Ghz, using Eigen 3.3.9, MSVC 16.11.2 with (among others) these relevant compiler swicthes /GL, /Gy, /O2, /D "NDEBUG", /Oi, and /arch:AVX. A tipical output looks like this:
Run on (16 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 262K (x8)
L3 Unified 16777K (x1)
--------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------
standard 99 ns 100 ns 7466667
eigen 169 ns 169 ns 4072727
which seems to show that operating on std::vector is ~69% faster than on Eigen::VectorXd. In the disassembly, the tight loops look like these:
// For Eigen::VectorXd
00007FF672221A11 vmovupd ymm0,ymmword ptr [rcx+rax*8]
00007FF672221A16 vaddpd ymm1,ymm0,ymmword ptr [r8+rax*8]
00007FF672221A1C vmovupd ymmword ptr [r8+rax*8],ymm1
00007FF672221A22 add rax,4
00007FF672221A26 cmp rax,rdx
00007FF672221A29 jge eigen+0C7h (07FF672221A37h)
00007FF672221A2B mov rcx,qword ptr [rsp+48h]
00007FF672221A30 mov r8,qword ptr [rsp+58h]
00007FF672221A35 jmp eigen+0A1h (07FF672221A11h)
// For std::vector
00007FF672221B40 vmovups ymm1,ymmword ptr [rax+rdx-20h]
00007FF672221B46 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx-20h]
00007FF672221B4C vmovups ymmword ptr [rax+rcx-20h],ymm1
00007FF672221B52 vmovups ymm1,ymmword ptr [rax+rdx]
00007FF672221B57 vaddpd ymm1,ymm1,ymmword ptr [rax+rcx]
00007FF672221B5C vmovups ymmword ptr [rax+rcx],ymm1
00007FF672221B61 lea rax,[rax+40h]
00007FF672221B65 sub r8,1
00007FF672221B69 jne standard+0C0h (07FF672221B40h)
One can notice that both use vaddpd to add 4 doubles at time. However, for std::vector the compiler unrolled the loop to perform 2 vaddpd per iteration but it didn't do the same for Eigen::VectorXd. Another potentially important difference is that the loop for std::vector is aligned to 32 bytes (address ends in 0x40 = 64 = 2*32).
FWIW: I've added /Qvec-report:2 and the compiler reports:
[...]\Core\AssignEvaluator.h(415) : info C5002: loop not vectorized due to reason '1305'
and reason 1305 means "Not enough type information."
My educated guess is that Eigen's effort to use intrinsics (here _mm256_add_pd) is counterproductive and confuses the compiler. Just leaving the compiler do its business (auto-vectorisation) seems to be a better idea. Am I missing something or could this be considered an Eigen bug (missed optimisation opportunity)?
TL;DR: The problem mainly comes from the constant loop bound and not directly from Eigen. Indeed, in the first case, Eigen store the size of the vectors in vector attributes while in the second case, you explicitly use the constant N.
Clever compilers can use this information to unroll loops more aggressively because they know that N is quite big. Unrolling a loop with a small N is a bad idea since the code will be bigger and has to read by the processor. If the code is not already loaded in the L1 cache, it must be loaded from the other caches, the RAM or even the storage device in the worst case. The added latency is often bigger than executing a sequential loop with a small unroll factor. This is why compilers do not always unroll loops (at least not with a big unroll factor).
Inlining also plays an important role in this code. Indeed, if the functions are inlined, the compiler can propagate constants and know the size of the vector enabling it to further optimize the code by unrolling the loop more aggressively. However, if the functions are not inlined, then there is no way the compiler can know the loop bounds. Clever compilers can still generate conditional algorithm to optimize both small loops and big ones but this makes the program bigger and introduces a small overhead. Compilers like ICC and Clang do generate the different code alternatives when the code can be vectorized but the loop bounds are unknown or also when aliasing is not known at compile time (the number of generated variants can quickly be huge and so the code size).
Note that inlining functions may not be enough since the constant propagation can be trapped by a complex conditionals dealing with runtime-defined variables or non-inlined function calls. Alternatively, the quality of the constant propagation may not be sufficient for the target example.
Finally, aliasing also play a critical role in the ability of compilers to generate SIMD instructions (and possibly better unroll the loop) in this code. Indeed, aliasing often prevent the use of SIMD instructions and it is not always easy for compilers to check aliasing and generate fast implementations accordingly.
Testing the hypothesises
If the vector-based implementation use a loop bound stored in the vector objects, then the code generated by MSVC is not vectorized in the benchmark: the constant is not propagated correctly despite the inlining of the function. The resulting code should be much slower. Here is the generated code:
$LL24#standard:
vmovsd xmm0, QWORD PTR [r9+rcx*8]
vaddsd xmm1, xmm0, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r8+rcx*8], xmm1
mov rax, QWORD PTR std::vector<double,std::allocator<double> > sv1+8
inc edx
sub rax, QWORD PTR std::vector<double,std::allocator<double> > sv1
sar rax, 3
mov ecx, edx
cmp rcx, rax
jb SHORT $LL24#standard
If the Eigen-based implementation use a constant loop bound, then the generated code by MSVC is well vectorized and unrolled correctly in the benchmark: the compile-time constant helps the compiler to generate an loop unrolled 2 times. It does that by mixing SSE and AVX instructions which is very surprising (this point is discussed below). The resulting code should be significantly faster than the original Eigen implementation. However, it may not be as fast as the initial vector implementation due to the unexpected use of SSE instructions. Here is the generated code:
$LL24#eigen:
vmovupd xmm1, XMMWORD PTR [rdx+rcx-16]
vaddpd xmm1, xmm1, XMMWORD PTR [rcx-16]
vmovupd xmm2, XMMWORD PTR [rcx+rdx]
vmovupd XMMWORD PTR [rcx-16], xmm1
vaddpd xmm1, xmm2, XMMWORD PTR [rcx]
vmovupd XMMWORD PTR [rcx], xmm1
vmovups ymm1, YMMWORD PTR [rdx+rcx+16]
vaddpd ymm1, ymm1, YMMWORD PTR [rcx+16]
vmovups YMMWORD PTR [rcx+16], ymm1
lea rcx, QWORD PTR [rcx+64]
sub rax, 1
jne SHORT $LL24#eigen
Additional notes
It is worth noting that the generated code for the non-inlined version use a very inefficient scalar code (typically due to N being unknown and pointer aliasing expected to be possible).
Mixing SSE and AVX instructions in such a loop in your case is clearly a sub-optimal strategy and likely a compiler issue/bug. Indeed, the execution speed of the resulting code is certainly bounded by the store instructions on Intel processors like your. Your processor can execute 1 store instruction per cycle, 2 load instructions per cycle and can compute 2 vectorized addition per cycle. It can execute up to 6 micro-instructions per cycle (coming from 5 independent instructions and possibly 4 cached additional instructions). As a result, the generated code mixing SSE and AVX will at least take 3 cycles per iterations. Meanwhile, the original vector-based implementation could execute 4 loads, 2 stores, 2 additions and 3 other instructions like lea/sub/branch in only 2 cycles (possibly 3 in practice due to to complex hardware stuff like the actual micro-instruction port scheduling, the micro-instruction cache). However, note that the compiler argument do not specify to optimize the code for your specific processor architecture (ie. Intel Coffee Lake). Still, I highly doubt mixing SSE and AVX code would result in any significant boost in performance on an AMD processors too (or any mainstream x86 processors). Alternatively, I might be because the MSVC fails to fully detect that there is no aliasing in this case.
To get rid of the most aliasing problems preventing code vectorization and loop unrolling, OpenMP SIMD directives (eg. #pragma omp simd) can be used. MSVC support this experimentally using the flag /openmp:experimental. Here is resulting code:
void add_vectors(Eigen::VectorXd& v1, Eigen::VectorXd const& v2) {
#pragma omp simd
for (unsigned i = 0; i < N; ++i)
v1[i] += v2[i];
}
MSVC surprisingly generates an assembly code with only SSE instructions, but if you enable AVX2, then it generate a relatively good code:
$LL26#eigen:
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
lea rdx, QWORD PTR [rdx+128]
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-192]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-192]
vmovupd YMMWORD PTR [rdx+rcx-192], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-160]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-160]
vmovupd YMMWORD PTR [rdx+rcx-160], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-128]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-128]
vmovupd YMMWORD PTR [rdx+rcx-128], ymm0
mov rcx, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev1
mov rax, QWORD PTR Eigen::Matrix<double,-1,1,0,-1,1> ev2
vmovupd ymm0, YMMWORD PTR [rdx+rcx-96]
vaddpd ymm0, ymm0, YMMWORD PTR [rdx+rax-96]
vmovupd YMMWORD PTR [rdx+rcx-96], ymm0
sub r8, 1
jne $LL26#eigen
This code is still not perfect due to the unexpected useless mov instructions.
Alternatively, it may be possible to use fixed-size Eigen vectors for better performance.
Finally, note that other compilers (like Clang, ICC and GCC) behave very differently on this benchmark.
I want to ensure that gcc knows:
The pointers refer to non-overlapping chunks of memory
The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3"
It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt).
Still unaligned moves.
No, using float *__attribute__((aligned(32))) x means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1
There is a way to do this, but it only helps for gcc, not clang or ICC.
See How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about __attribute__((aligned(32))), which does work for GCC.
I used __restrict instead of __restrict__ because that C++ extension name for C99 restrict is portable to all the mainstream x86 C++ compilers, including MSVC.
typedef float aligned32_float __attribute__((aligned(32)));
void prod(const aligned32_float * __restrict x,
const aligned32_float * __restrict y,
int size,
aligned32_float* __restrict out0)
{
size &= -16ULL;
#if 0 // this works for clang, ICC, and GCC
x = (const float*)__builtin_assume_aligned(x, 32); // have to cast the result in C++
y = (const float*)__builtin_assume_aligned(y, 32);
out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif
for (int i = 0; i < size; ++i) {
out0[i] = x[i] * y[i]; // auto-vectorized with a memory operand for mulps
// note clang using two separate movups loads
// instead of a memory operand for mulps
}
}
(gcc, clang, and ICC output on the Godbolt compiler explorer).
GCC and clang will use movaps / vmovaps instead of ups any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never use movaps for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the -mavx256-split-unaligned-load/store effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.
vmovups is not a performance problem when used on aligned memory; it performs identically to vmovaps on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your -march=haswell output. Only older CPUs, before Nehalem and Bulldozer, always decoded movups to multiple uops.
The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for mulps unless it's aligned.
A good test case for this is out0[i] = x[i] * y[i], where the load result is only needed once. Or out0[i] *= x[i]. Knowing alignment enables movaps/mulps xmm0, [rsi], otherwise it's 2x movups + mulps. You can check for this optimization even on compilers like ICC or MSVC, which use movups even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.
It seems __builtin_assume_aligned is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers to struct aligned_floats { alignas(32) float f[8]; };, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back to float *
I try to use one read and lots of write to saturate the cpu ports for writing.
Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.
If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.
Semi-related: a 2x load + 1x store memory access pattern like c[i] = a[i]+b[i] or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.
But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.
BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.
Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an and rsp,-32 because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)
<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
I have a micro-controller that does not have an MMU, but we are using C and C++.
We are avoiding all dynamic memory usage (i.e. no new SomeClass() or malloc()) and most of the standard library.
Semi-Question 0:
From what I understand std::array does not use any dynamic memory so its usage should be OK (It is on the stack only). Looking at std::array source code, it looks fine since it creates a c-style array and then wraps functionality around that array.
The chip we are using has 1MB of flash memory for storing code.
Question 1:
I am worried that the use of templates in std::array will cause the binary to be larger, which will then potentially cause the binary to exceed the 1MB code memory limit.
I think if you create an instance of a std::array< int, 5 >, then all calls to functions on that std::array will occupy a certain amount of code memory, lets say X bytes of memory.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more code memory? X bytes of memory + Y bytes of memory.
If so, do you think the amount of code generated given the limited code memory capacity will be a concern?
Question 2:
In the above example, if you created a second std::array< int, 10 > instance, would the calls to functions also duplicate the function calls in the generated code? Even though both instances are of the same type, int?
std::array is considered a zero cost abstraction, which means it should be fairly optimizable by the compiler.
As of any zero cost abstraction, it may induce a small compile time penality, and if the opimizations required te be truely zero cost are not supported, then it may incur a small size or runtime penality.
However, note that compiler are free to add padding at the end of a struct. Since std::array is a struct, you should check how your platform is handling std::array, but I highly doubt it's the case for you.
Take this array and std::array case:
#include <numeric>
#include <iterator>
template<std::size_t n>
int stuff(const int(&arr)[n]) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
int arr[] = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
#include <numeric>
#include <iterator>
#include <array>
template<std::size_t n>
int stuff(const std::array<int, n>& arr) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
std::array arr = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
Clang support this case very well. all cases with std::array or raw arrays are handleld the same way:
-O2 / -O3 both array and std::array with clang:
main: # #main
mov eax, 21
ret
However, GCC seem to have a problem optimizing it, for bith the std::array and the raw array case:
-O3 with GCC for array and std::array:
main:
movdqa xmm0, XMMWORD PTR .LC0[rip]
movaps XMMWORD PTR [rsp-40], xmm0
mov edx, DWORD PTR [rsp-32]
mov eax, DWORD PTR [rsp-28]
lea eax, [rdx+14+rax]
ret
.LC0:
.long 1
.long 2
.long 3
.long 4
Then, it seem to optimize better with -O2 in the case of raw array and fail with std::array:
-O2 GCC std::array:
main:
movabs rax, 8589934593
lea rdx, [rsp-40]
mov ecx, 1
mov QWORD PTR [rsp-40], rax
movabs rax, 17179869187
mov QWORD PTR [rsp-32], rax
movabs rax, 25769803781
lea rsi, [rdx+24]
mov QWORD PTR [rsp-24], rax
xor eax, eax
jmp .L3
.L5:
mov ecx, DWORD PTR [rdx]
.L3:
add rdx, 4
add eax, ecx
cmp rdx, rsi
jne .L5
rep ret
-O2 GCC raw array:
main:
mov eax, 21
ret
It seem that the GCC bug failling to optimize -O3 but succeed with -O2 is fixed in the most recent build.
Here's a compiler explorer with all the O2 and the O3
With all these cases stated, you can see a common pattern: No information about the std::array is outputted in the binary. There are no constructors, no operator[], not even iterators, nor algorithms. Everything is inlined. Compiler are good at inlining simple functions. std::array member functions are usually very very simple.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more flash memory? X bytes of memory + Y bytes of memory.
Well, you changed the data type your array is containing. If you manually add overload of all your functions to handle this additional case, then yes, all those new functions may take up some space. If your function are small, there is a great chance for them to be inlined and take less space. As you can see with the example above, inlining and constant folding may greatly reduce your binary size.
In the above example, if you created a second std::array instance, would the calls to functions also duplicate the function calls in flash memory? Even though both instances are of the same type, int?
Again it depends. If you have many function templated in the size of the array, both std::array and raw arrays may "create" different function. But again, if they are inlined, there is no duplicate to be worried about.
Both will a raw array and std::array, you can pass a pointer to the start of the array and pass the size. If you find this more suitable for your case, then use that, but still raw array and std::array can do that. For raw array, it implicitly decays to a pointer, and with std::array, you must use arr.data() to get the pointer.
Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]
class X
{
public:
void process(float * __restrict d, int size)
{
for (int i = 0; i < size; ++i)
{
d[i] = v * c + d[i];
v = d[i];
}
}
void processFaster(float * __restrict d, int size)
{
float lv = v;
for (int i = 0; i < size; ++i)
{
d[i] = lv * c + d[i];
lv = d[i];
}
v = lv;
}
float c{0.0f};
float v{0.0f};
};
With gcc -O3 the first one has an inner loop that looks like:
.L3:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rax, rdi
movss DWORD PTR x[rip+4], xmm0 ;<<< the extra store
jne .L3
.L1:
rep ret
The second here:
.L8:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rdi, rax
jne .L8
.L7:
movss DWORD PTR x[rip+4], xmm0
ret
See https://godbolt.org/g/a9nCP2 for the complete code.
Why does the compiler not perform the lv optimisation here?
I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.
Am I right to make that assumption?
I think the observable behaviour should be the same in both cases.
This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.
The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.
The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).
In the second case there is no external visible effect of v so it doesn't need to store it.
In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.
If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.